PCA with Python!

PCA is more than a century old algorithm,  invented in 1901 by Karl Pearson, now used for feature extraction and data exploration. We have already see here, now lts see some code in action.

PCA is all about fining the vectors that maximize the distance of projected points on to the vector, for example, orthogonal projection of  \(a\)  on to a vector \(b\) we need to find the \(a_{1}\) scalar projection of the point(point is a vector) with reference to the vector \(b\) and multiply by the unit vector of \(b\), here

Projection of a on b (a1), and rejection of a from b (a2).

\(a_{1}=a.\hat{b}\), where \(\hat{b}\) is unit vector of \(b\)

We can say, when we have X points and w vectors, we say maximize \(Xw\), that is (maximize  \(Xw)^2\)

\[argmax\space L (w)= ||Xw||^2\], \[L(w)=||wX||.||wX||^T = w^TX^TXw  \space \space constraint \space w^Tw =1\]

Using Lagrange multipliers

\[L(w) = w^T(X^TX)w-\lambda (w^Tw-1)=0\]

Skipping steps and finally \[X^TX)w = \lambda w\] where \(w\) is the eigenvectors (unit vectors) and \(λ\) is the eigenvalue (scalar distances)

Let’s get down to the code starting with single point  pt = np.array([1, 1.5]) and a vector v=np.array([2, 1.5])

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# tells matplotlib to embed plots within the notebook
%matplotlib inline

use Numpy function normalize to get unit vector \(hat{b}\) and the use function dot to do vector multiplication.

First get the \(a_{1}\) and

v=v/np.linalg.norm(v)
vr = v.reshape((1, 2))  # A row vector
d = vr.dot(pt)  # d scalar projection on to b
# scale the pt by d to get projection
p = vr.T.dot(d)

Now let’s plot and see

fig, ax = plt.subplots(figsize=(7, 6)) 
ax.plot(p[0], p[1], 'ro', mec='r', mew=2, mfc='none')
ax.plot(pt[0], pt[1], 'ro', mec='b', mew=2, mfc='none') 
ax.plot(pt[0], pt[1], 'ro', mec='b', mew=2, mfc='none') 
ax.plot([0,v[0]*4], [0,v[1]*4], '--k', lw=1) plt.axis([0, 4.75, 0, 4.75])

As we can from the figure below, the original point isi in ‘blue’ and projected ones in ‘red’.

Same with more points

Let’s expand this and do for a bigger data set.

X = np.random.multivariate_normal([0, 0], [[3, 1.5], [1.5, 1]], size=50).T 
#Centering the data is must, It's also called normalizing, 
#We do it by subtracting every point by the mean. 
#And can be done by dividing by standard deviation σ 
x_mean = X.mean(axis=1)
X[0] = X[0] - x_mean[0]
X[1] = X[1] - x_mean[1]

Plot the data

fig, ax = plt.subplots(figsize=(7, 6))
ax.scatter(X[0],X[1])

There are functions in Numpy that we can use to get the eigenvector and eigenvalues

cov_mat = np.cov(X)  
eig_vals, eig_vecs = np.linalg.eigh(cov_mat) 
eig_vals, eig_vecs
 
The eigen values are 
array([0.12436575, 4.31093611]), and 
eigen vectors are 
array([
[ 0.48025728, -0.87712767], 
[-0.87712767, -0.48025728]])

Now we have the vectors but which vector maximizes the variance, that can be said by looking at eigen value. The higher the eigen value higher the variance, hence we go with 4.31093611. In can of large number  we can do a pair-wise sort and take the top k vectors. As we have seen earlier here, we can use the cumulative plot to arrive at the k numbers.

fig, ax = plt.subplots(figsize=(6,5))
ax.plot(X[0,:], X[1,:], 'bo', ms=10, mec='k', mew=0.25)
for i in range(2):
   ax.arrow(x_mean[0],x_mean[1],eig_vals[i]*eig_vecs[0, i],/
eig_vals[i]*eig_vecs[1, i], head_width=0.25, head_length=0.2, fc='r', ec='r', lw=2, zorder=1000) #ax.axis([0.5, 6.5, 2, 8]) ax.axis([-4.5, 4.75, -4, 3.75]) ax.set_aspect('equal') ax.grid(False)

here is plot , we can ignore the direction, since the Numpy function returns like that. Now we need to select the vector that maximizes the variance and reduce the data.

Here is the sort and plot for the cummulative sum. We can see first vector is good enough

eig_pairs = [(np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

with plt.style.context(‘seaborn-whitegrid’):
plt.figure(figsize=(6, 4))

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    plt.bar(range(2), var_exp, alpha=0.5, align='center',
            label='Individual Variance')
    plt.step(range(2), cum_var_exp, where='mid',
             label='Cumulative variance')
    plt.ylabel('Variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()

Now that we established, vector one is good enough, let’s get the scores and plot

U = np.hstack((eig_pairs[0][1].reshape(2,1), 
eig_pairs[1][1].reshape(2,1)))
#multiply vector at [0] with the original data, we get Z (50,1)
Z = X.T.dot(U)[:,:1]
#now mutiply Z(50,1) with U(1,2), we get the flattened data back
X_rec=Z.dot(U[:,:1].T).T 

Plot the X_rec and X original data and see


fig, ax = plt.subplots(figsize=(7, 6))
ax.plot(X_rec[0,:], X_rec[1,:], 'bo', ms=5, mec='none', mew=1)
ax.plot(X[0,:], X[1,:], 'ro', ms=5, mec='r', mew=1, mfc='none')
#ax.axis([0.5, 6.5, 2, 8])
ax.axis([-4.5, 4.75, -4, 3.75])
ax.set_aspect('equal')
ax.grid(False)

Here is the final plot. Note the recovered data is not same as original because we used only one PC, instead if we use all PCs  we could recover the original data

Z = X.T.dot(U)
X_rec=Z.dot(U.T).T

That’s the long way, there is a short way and very proven way, to use Sklearn libs.


from sklearn.preprocessing import StandardScaler
from scipy import linalg
scaler = StandardScaler()
scaler.fit(X2)
#use SVD 
U, S, V = linalg.svd(scaler.transform(X2).T)
#or use PCA directly to reduce and recover 
from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=None)
#get the scores matrix
Y_sklearn = sklearn_pca.fit_transform(X)
#recover the data back
X_rec=sklearn_pca.inverse_transform(Y_sklearn)

That’s all folks, PCA done and dusted! 😉