SVM, Optimization and Kernels

Here again, with notes, that I could get back to, when my memory fails!. This is one of those posts that would clear doubts! on optimization techniques, math behind kernels et al.

SVM, uses same the basic vector inner product that we had seen already here in getting to the bottom of the PCA! \(a_{1}=a.\hat{b} \) where \(\hat{b}\) is unit vector of \(b\)

Vector Inner Product

We use SVM to classify data points by drawing a plane that would separate the data. We want to draw the plane such a way that the distance between the plane and ponts is maximum, also called margin.

We can see in the figure above, the best fit is where \(frac 2 ||w||\) is maximum.  That can be written as a minimizing problem \(min || w || \) for the sake of convenience  \(min || w ||^2 \). More here under  Hard-margin wiki and very good video MIT Open Courseware

So we can write \[min_{w} || w ||^2 \]

\[subject \space to \space \space y_i(\mathbf{w}\cdot\mathbf{x_i}+b) \geq 1\]

With regularization

\[ \min_{w} || w ||^2 + C \sum_{i} ^N \max (0, 1 − y_{i} f(x_{i})) \]

\(f(x_{i}) \)  is \( y_{i}(w.x_{i}+b)\)  and \( \max (0, 1 − y_{i} f(x_{i})) \) is Hinge loss function.

Optimization:

Now that we know the optimization problem and the conditions, there are popular optimization techniques. Mainly Lagrange multipliers, KKT and others but the recent approaches in machine learning is solving online using Gradient Descent algorithms like sub-gradient, SGD and others.

Update rule for SGD 

if (w = w + \ (y_ix_i – 2lambda w) \)  then  \(w = w + eta (y_ix_i – 2\lambda w)\)

else: \( w = w + \eta (y_ix_i – 2\lambda w)\)

Briefly here is the pseudo-code for SG 

  • Initialize the weight vector  and epoch w=0
  • Set the learning rate η=1 and regularizer  λ=0.01 and calculate  η= 1/(epoch*λ) for each iteration
  • Set the number of epochs = 10000
  • loop for every point find  mis-classification condition
    • if \(y_{i}⟨x_{i},w⟩<1\)
      • then update weights \(w=w+η(y_{i}*x_{i}−2λw)\)
    • else \(w=w+η(−2λw)\)
  • Loop for the length of epochs
#X array (m records x n features) and y array of n labels 1 and -1  
#initializing weight number of features or variables size n features
w = np.zeros(X.shape[1])
#initialize learning rate and epoch
eta = 1
epochs = 100000
#loop until epochs 
for e in range(1,epochs):
    for i, x in enumerate(X):
        if (y[i]*np.dot(X[i], w)) < 1:
            w = w + eta * ((y[i]*X[i]) + (-2*(1/e)*w))
            #w = w + eta * ( (X[i] * Y[i]) + (-2  *(1/e)* w))
        else:
            w = w + eta * (-2*(1/e)*w)
return w

Now that we have the weights, how to go about predicting and drawing the hyperplane, here we go with a sample data

X = (array([[ 8, 7], [ 4, 10], [ 9, 7], 
            [ 7, 10], [ 9, 6], [ 4, 8],
            [10, 10], [ 2, 7], [ 8, 3], 
            [ 7, 5], [ 4, 4],  [ 4, 6], [ 1, 3], [ 2, 5]]), 
y=array([ 1., 1., 1., 1., 1., 1., 1., -1., -1., -1., -1., 
          -1., -1., -1.]))

And the scatter plot looks like this

let’s run through the SGD and get the weights (w)

w=svm_sgd(X,y)
#w is array([-4.24898618, -6.08613149])

Calculate bias (b). It is average of b(s) for each point in X. b for a point is (y-w.X)
that can be written as

b=sum(y-X.dot(w))/m #where m is length of len(X) or X.shape[0]
#b is 63.53627670394709

Now we have w and b let’s go ahead and plot

w0 = w[1] # or w[0]
a = -w[0] / w[1]
#add a line from -1 to 10 (min of X and max of X)
xx = np.linspace(-1, 10)
# y= aX+b 
yy = a * xx - (b) / w[1]
#plot
ax = plt.gca()
#plot the separator
ax.plot(xx,yy)
#plot the points with labels
ax.scatter(X[:,0], X[:,1], c=y)
plt.show()

Here is the final plot with hyperplane

Kernel:

So, where is the kernel? So far we have see the data that is linearly separable, what if the data is not linearly separable and something that looks like this?

A 3D plot with y, below we can see that a hyperplane at y=0.4 can separate the data

Curtsy, Course by Prof. Andrew Ng

 

There is not a linear line that can separate but there seems to be curved line that can separate the data. So, there is a trick that we can use to transforms the  data to another space where it can be separated linearly, that transformation is called kernel.  Denoted by (phi), the equation with kernel would like this

\[ \min_{w} || w ||^2 + C \sum_{i} ^N \max (0, 1 − y_{i} \phi(x_{i})) \]

Some of the kernels are

  • Gaussian Radial Basis Function (RBF) Kernel

\[ K(x_i,x_j)=\mathrm{\exp}\left( \frac {|x_j-x_i|^2} {\sigma^2}\right)\]

  • Polynomials Kernel of degree up to (d)

\[ K(x,y)=(x^Ty+1)^d\]

  • Sigmoid Kernel

\[ K(x,y)=\tanh(\alpha x^Ty+c)\]

Finally, we don’t have re-invent the wheel, this post is all about getting to know the mechanics of the algorithms, it helps when we use popular libraries to solve real world classifications.

There are already SVM implementation libSVM is one most popular and of-course we have Scikit libraries. We shall see sometime a real like SVM in action