PCA, Principal Components Analysis , as the definition indicates, analyzing the data by reducing the number components(dimensions), that represents maximum number of data. just like 80/20 rule. where 80% is represented by 20%

FIg above, X-axis is the number of reduced components and Y-axis if fidelity(0 to 1) 1 being highest. One can see, with 100(10%) components, more than 93% of data are represented, while the original data had 1024 variables.

For example an image of 1024 pixels can be reduced to 100 pixel retaining 90% of accuracy. *image curtsy ML by Andrew Ng

Below where original image 1024 vs K= 100

Below where original image 1024 vs K =250

What is K? We will see shortly more about K components(PCA) that help in dimensionality reduction.

So, PCA is about reducing the dimension, in another example below, let’s explore with more mathematics and code. The data below is about consumption of food from 4 different zones, represented by 17 variables, that is in a (4×17) matrix. Using PCA let’s see how many reduced components would be needed to represent the data

England | Wales | Scotland | N Ireland | |
---|---|---|---|---|

Cheese | 105 | 103 | 103 | 66 |

Carcass Meat | 245 | 227 | 242 | 267 |

Other Meat | 685 | 803 | 750 | 568 |

Fish | 147 | 160 | 122 | 93 |

Fats and Oils | 193 | 235 | 184 | 209 |

Sugars | 156 | 175 | 147 | 139 |

Fresh potatoes | 720 | 874 | 566 | 1033 |

Fresh Veg | 253 | 265 | 171 | 143 |

Other Veg | 488 | 570 | 418 | 355 |

Processed potatoes | 198 | 203 | 220 | 187 |

Processed Veg | 360 | 365 | 337 | 334 |

Fresh Fruits | 1102 | 1137 | 957 | 674 |

Cereals | 1472 | 1582 | 1462 | 1494 |

Beverages | 57 | 73 | 53 | 47 |

Soft drinks | 1374 | 1256 | 1572 | 1506 |

Alcoholic drinks | 375 | 475 | 458 | 135 |

Now, a bit of maths! The PCA is all about projecting the data points on to a vector that maximizes the sum of all distances between the origin and the projected data point. Obvious question is why maximize? the idea is to retain as many projected data points to as many original data points.

That is if we had (m) number of data points the projected data points should also be (m) or close enough. To help with this we already have (eigen vectors)!

The classic approach to PCA is to perform the eigendecomposition on the covariance matrix (Sigma)

[sigma_{jk} = frac{1}{n-1}sum_{i=1}^{n}left( x_{ij}-bar{x}_j right) left( x_{ik}-bar{x}_k right)]

In matrix notation

[Sigma = frac{1}{n-1} left( (mathbf{X} – mathbf{bar{x}})^T;(mathbf{X} – mathbf{bar{x}}) right)]

Where (mathbf{X}) is a normalized matrix of variables (mathbf{bar{x}}) is the mean of vector ( mathbf{bar{x}} = frac{1}{n} sumlimits_{i=1}^n x_{i})

In linear algebra, one of the key theorems is the (Spectral Theorem). It states if (S) is any symmetric n by n matrix with real coefficients, then (S) has n (eigenvectors) with all the eigenvalues being real. That means we can write (S=ADA−1) with (D) a diagonal matrix with positive entries. That is (D=diag(λ1,λ2,…,λn)) and a sorted eigenvectors based on respective eigenvalues (λ1≥λ2≥…≥λn)

We can use Singular Value Decomposition (SVD), generalization from the Spectral theorem. It states that (X = U Sigma V^{t}) with U,V orthonormal (square) matrices of size n and p and (Sigma = (s_{ij})), where (X) is a matrix of data with n observations of p variables, More precisely ( Sigma V^{t})is the PCA decomposition

Using SVD, we compute the eigenvectors (U) and eigenvalues (S) of a covariance: (Cov(X)) The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

U is also represented as (W), as loadings and the columns are Principal componenets. The (W) will be of the size (pxp), that is size of number of variables, with first column having maximum variance and while others in decreasing variance.

Now the task is to determine the number of columns that represent maximum data points say 95%, we use S to determine that. we can plot a simple cumulative sum vs the number of variables and see how many we need really, sample is given below.

X-axis is the Principal Components, while Y-axis is the cumulative sum, of (S) starting with highest eigenvalue. As we can see first column of (U) covers about 70% and with 3 columns we can cover 100%. That is how the dimension is reduced from 17 variables to 3 Principal components!

Now let’s get the projected data, using only the first column, computing X.U, gives us the following reduced dimension. (More in Part 2 where we will see how to compute PCA using code)

Zones | PC1 | PC2 | |
---|---|---|---|

0 | England | -0.844464 | 0.286952 |

1 | Wales | -3.902203 | -1.514427 |

2 | Scotland | 0.414807 | 2.804284 |

3 | N Ireland | 4.331860 | -1.576809 |

Now plot only PC1 we can see how the zones food consumption is! N Ireland is far from others while England and Scotland consume similar way, and Wales just opposite to N Ireland, no wonder they are different countries of UK!

Here we plot PC1 vs PC2, gives an idea how the zones stay far from each other. With original data 17 variables, no one would have thought the how disparate the zones are!

**Use cases of PCA**

- PCA is a dimensionality reduction algorithm, it is typically used to flatten data like image processing or get a better insight from multidimensional data
- It is used to feature extraction , so that it can be used in other Machine learning algorithms, like SVM, Neural networks and image classification
- PCA is also an unsupervised technique, to do classification or regression
- PCA may not be used where data is not linear and sometimes, reduction results in losses

That’s all folks, long one! whew, next part we will see how to use Python code for these findings!