# Principal Component Analysis (PCA) – Better Explained

Principal Components Analysis (PCA) is an algorithm to transform the columns of a dataset into a new set of features.

A large chunk of the information across a large dataset can effectively be compressed in fewer columns. This enables dimensionality reduction and ability to visualize the separation of classes or clusters if any.

In this tutorial, I discuss the complete concept, the objective function, the step-by-step implementation of PCA from scratch in an easy to understand manner.

If you are only interested in the PCA algorithm skip part 1 and go to part 2 of this post below.

[x_promo image=”https://www.machinelearningplus.com/wp-content/uploads/2019/11/feature_image_pca.jpg”]Principal Components Analysis (PCA) – Better Explained. Photo by RockyClub.[/x_promo]

## Content

[columnize]
1. Introduction: What is PCA?
2. Part 1: Implementing PCA using scikit learn

3. Building PCA with Scikit-learn
4. Weights of Principal Components
5. Percentage of Variance Explained with each PC
6. Plot the clustering tendency
7. How to get the original features back
8. Part 2: Understanding Concepts behind PCA

9. Objective function of PCA
10. How to understand the rotation of coordinate axes

11. What is Eigen Value and Eigen Vector
12. Part 3: Steps to Compute Principal Components from Scratch

13. Import Data
14. Step 1: Standardize each column
15. Step 2 Compute Covariance Matrix
16. Step 3: Compute Eigen values and Eigen Vectors
17. Step 4: Derive Principal Component Features by taking dot product of eigen vector and standardized columns
18. Conclusion
19. [/columnize]

## 1. Introduction: What is PCA?

PCA is fundamentally a simple dimensionality reduction technique that transforms the columns of a dataset into a new set features. It does this by finding a new set of directions (like X and Y axes) that explain the maximum variability in the data.

This new system coordinate axes is called Principal Components (PCs). The projections of the original data on the new set of coordinate axes (PCs) serves as the new transformed dataset.

But why even compute the PCs?

Because, the information contained in a dataset column (variable) is the amount of variance it contains. It will effectively explain the vast information present in the original data into a fewer number of columns.

In this post, I first show how to implement PCA with ready-made functions from `scikit-learn`. Then, I will explain the concepts more clearly and then go into the step-by-step algorithmic implementation of PCA from scratch.

Structure of the Post:
Part 1: Implementing PCA using `scikit-Learn` package
Part 2: Understanding Concepts behind PCA
Part 3: PCA from Scratch without `scikit-learn` package.

Let’s first understand the data at hand.

## Part 1: Implementing PCA using scikit learn

##### Dataset Description and Practical Uses of PCA

I’ll use the MNIST dataset, where each row represents a square image of a handwritten digit (0-9). The values in each cell range between 0 and 255 corresponding to the gray-scale color. Each image is of 28×28=784 pixels, so the flattened version of this will have 784 column entries.

why use PCA?

Practically PCA is used for two reasons:

1. Dimensionality Reduction: The information distributed across a large number of columns is transformed into principal components (PC) such that the first few PCs can explain a sizeable chunk of the total information (variance). These PCs can be used as explanatory variables in Machine Learning models.

2. Visualize Classes: Visualising the separation of classes (or clusters) is hard for data with more than 3 dimensions (features). With the first two PCs itself, it’s usually possible to see a clear separation.

Is PCA a feature selection technique?

It is not a feature selection technique. Rather, it is a feature combination technique. Because each PC is a weighted additive combination of all the columns in the original dataset. More on this when you implement it in the next section.

However, the PCs are formed in such a way that the first Principal Component (PC1) explains more variance in original data compared to PC2. Likewise, PC2 explains more than PC3, and so on.

## 2. Building PCA with Scikit-learn

Using `scikit-learn` package, the implementation of PCA is quite straight forward. The module named `sklearn.decomposition` provides the `PCA` object which can simply fit and transform the data into Principal components.

``````# Load Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pd.set_option('max_rows', 500)
pd.set_option('max_columns', 500)
np.set_printoptions(suppress=True)
%matplotlib inline
``````
##### Import Data

Let’s import the `mnist` dataset. For ease of learning, I am importing a smaller version containing records for digits 0, 1 and 2 only.

``````# Import Data

# Prepare X and Y
Y = df.loc[:, '0']
X = df.drop(['0'], axis=1)

print(df.shape)
``````

(3147, 785) This dataset has 784 columns as explanatory variables and one Y variable names `'0'` which tells what digit the row represents.

We won’t use the Y when creating the principal components. Because I don’t want the PCA algorithm to know which class (digit) a particular row belongs to.

Rather, I create PCs using only the X. Later you will see, we draw a scatter plot using the first two PCs and color the points based in the actual Y. Typically, if the X’s were informative enough, you should see clear clusters of points belonging to the same category.

#### Building the Principal Components

Let’s first create the Principal components of this dataset.

First, I initialize the `PCA()` class and call the `fit_transform()` on `X` to simultaneously compute the weights of the Principal components and then transform `X` to produce the new set of Principal components of `X`. This I am storing in the `df_pca` object, which is converted to a pandas DataFrame.

``````# PCA
pca = PCA()
df_pca = pca.fit_transform(X=X)

# Store as dataframe and print
df_pca = pd.DataFrame(df_pca)
print(df_pca.shape)  #> (3147, 784)
`````` The first column is the first PC and so on. This dataframe (`df_pca`) has the same dimensions as the original data `X`.

## 3. Weights of Principal Components

The `pca` has been built.

The `pca.components_` object contains the weights (also called as ‘loadings’) of each Principal Component. It is using these weights that the final principal components are formed.

But what exactly are these weights? how are they related to the Principal components we just formed and how it is calculated?

Well, in part 2 of this post, you will learn that these weights are nothing but the eigenvectors of X. More details on this when I show how to implement PCA from scratch without using sklearn’s built-in `PCA` module.

The key thing to understand is that each principal component is the dot product of its weights (in `pca.components_`) and the mean-centered data(X).

What I mean by ‘mean-centered’ is, each column of the ‘X’ is subtracted from its own mean so that the mean of each column becomes zero. Let’s actually compute this, so its very clear.

``````# Principal Components Weights (Eigenvectors)
`````` Each row actually contains the weights of Principal Components, for example, Row 1 contains the 784 weights of PC1. No need to pay attention to the values at this point, I know, the picture is not that clear anyway.

Step 2: Compute the mean centered data.

``````X_mean = X - X.mean()
``````

In above dataframe, I’ve subtracted the mean of each column from each cell of the respective column itself. So the mean of each column now is zero. Step 3:

If you go by the formula, take a dot product of of the weights in the first row of `pca.components_` and the first row of the mean centered `X` to get the value -134.27. This equals to the value in position (0,0) of `df_pca`.

Do refer back to the pic in section 2 to confirm this.

``````# Compute PC1 for row 1.
#> -134.272
``````

Likewise, all the cells of the principal components matrix (`df_pca`) is computed this way internally.

## 4. Percentage of Variance Explained with each PC

The PCs are usually arranged in the descending order of the variance(information) explained. To see how much of the total information is contributed by each PC, look at the `explained_variance_ratio_` attribute.

``````print(pca.explained_variance_ratio_.round(2)[:10])
#> [0.22 0.1  0.06 0.06 0.04 0.04 0.03 0.03 0.02 0.02]
``````

PC1 contributed 22%, PC2 contributed 10% and so on.

The further you go, the lesser is the contribution to the total variance. Plotting a cumulative sum gives a bigger picture.

``````variance_exp_cumsum = pca.explained_variance_ratio_.cumsum().round(2)
fig, axes = plt.subplots(1,1,figsize=(16,7), dpi=100)
plt.plot(variance_exp_cumsum, color='firebrick')
plt.title('Screeplot of Variance Explained %', fontsize=22)
plt.xlabel('# of PCs', fontsize=16)
plt.show()
`````` ## 5. Plot the clustering tendency

If you draw a scatterplot against the first two PCs, the clustering of data points of 0, 1 and 2 are clearly visible. Such graphs are good to show your team/client.

Let me define the encircle function to enable encircling the points within the cluster.

``````from scipy.spatial import ConvexHull

def encircle(x,y, ax=None, **kw):
if not ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
``````

Let’s plot the first two principal components along the X and Y-axis.

``````# Scatterplot against PC1 and PC2
fig, ax = plt.subplots(1,1, figsize=(16,12))

# Row masks for each category
rows_0 = Y==0;
rows_1 = Y==1;
rows_2 = Y==2;

# Plot
ax.scatter(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], c='blue', edgecolor='k', s=120, label='Zero')
ax.scatter(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], c='red', edgecolor='k', s=120, label='One')
ax.scatter(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], c='green', edgecolor='k', s=120, label='Two')

# Encircle the boundaries
encircle(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], ec="blue", fc="none", linewidth=2.5)
encircle(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], ec="firebrick", fc="none", linewidth=2.5)
encircle(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], ec="green", fc="none", linewidth=2.5)

encircle(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], ec="k", fc="firebrick", alpha=0.05)
encircle(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], ec="k", fc="blue", alpha=0.05)
encircle(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], ec="k", fc="green", alpha=0.05)

# Labels
ax.set_title("MNIST Data for 1, 2 & 3: Scatterplot of First Two PCA directions", fontsize=22)
ax.set_xlabel("1st Principal Component", fontsize=22)
ax.set_ylabel("2nd Principal Component", fontsize=22)
ax.legend(loc='best', title='Transaction Type', fontsize=16)
plt.show();
`````` In the picture, though there is a certain degree of overlap, the points belonging to the same category are distinctly clustered and region bound. This proves that the data captured in the first two PCs is informative enough to discriminate the categories from each other.

In other words, we now have evidence that the data is not completely random, but rather can be used to discriminate or explain the Y (the number a given row belongs to).

Refer to the 50 Masterplots with Python for more visualization ideas.

## 6. How to get the original features back

The fitted `pca` object has the `inverse_transform()` method that gives back the original data when you input principal components features.

``````df_orig = pca.inverse_transform(df_pca)
``````

The above code outputs the original input dataframe.

## Part 2: Understanding Concepts behind PCA

To simplify things, let’s imagine a dataset with only two columns. Using these two columns, I want to find a new column that better represents the ‘data’ contributed by these two columns.
This new column can be thought of as a line that passes through these points.

Such a line can be represented as a linear combination of the two columns and explains the maximum variation present in these two columns. In what direction do you think the line should stop so that it covers the maximum variation of the data points? Such a line should be in a direction that minimizes the perpendicular distance of each point from the line.

It may look something like this: But how to determine this line?

I am only interested in determining the direction(`u1`) of this line. Because, by knowing the direction `u1`, I can compute the projection of any point on this line. This line `u1` is of length 1 unit and is called a unit vector. This unit vector eventually becomes the weights of the principal components, also called as loadings which we accessed using the `pca.components_` earlier.

Remember the PCA weights you calculated in Part 1 under ‘Weights of Principal Components’? It is same as the ”u1′ I am talking about here.

The PCA weights (`Ui`) are actually unit vectors of length 1. Because, it is meant to represent only the direction.

``````# PCA weights has length = 1
#> 0.9999999999999993
``````

## 7. Objective function of PCA

So, how to find this unit vector (`u1`)?

Remember, we wanted to minimize the distances of the points from PC1’s direction?

`u1` is the unit vector that direction.

To determine `u1`, we use the Pythagoras theorem to arrive at the objective function as shown in the pic.

In the pic below, `u1` is the unit vector of the direction of PC1 and `Xi` is the coordinates of the blue dot in the 2d space. Remember, `Xi` is nothing but the row corresponding the datapoint in the original dataset.

The lengths of the lines can be computed using the Pythagoras theorem as shown in the pic below. ### Objective function:

The objective is to determine `u1` so that the mean perpendicular distance from the line for all points is minimized. Here is the objective function: It can be proved that the above equation reaches a minimum when value of `u1` equals the Eigen Vector of the covariance matrix of `X`. This Eigen Vector is same as the PCA weights that we got earlier inside `pca.components_` object. We’ll see what Eigen Vectors are shortly.

Alright. But there can be a second PC to this data. The next best direction to explain the remaining variance is perpendicular to the first PC.

Actually, there can be as many Eigen Vectors as there are columns in the dataset. And they are orthogonal to each other. Thanks to this excellent discussion on stackexchange that provided these dynamic graphs.

## How to understand the rotation of coordinate axes

Now you know the direction of the unit vector. To compute the Principal components, we rotate the original XY axis of to match the direction of the unit vector.

The Principal components are nothing but the new coordinates of points with respect to the new axes. ## 8. What is Eigen Value and Eigen Vector?

If you were like me, Eigenvalues and Eigenvectors are concepts you would have encountered in your matrix algebra class but paid little attention to. Well, Eigen Values and Eigen Vectors are at the core of PCA.

What is Eigen vector?

`v` is an eigenvector of matrix `A` if A(v) is a scalar multiple of v.  The actual computation of Eigenvector and Eigen value is quite straight forward using the `eig()` method in `numpy.linalg` module. Refer to this guide if you want to learn more about the math behind computing Eigen Vectors.

## 9. Import Data

``````# Import
url = 'https://raw.githubusercontent.com/selva86/datasets/master/mnist_012.csv'

# Prepare X and Y
Y = df.loc[:, '0']
X = df.drop(['0'], axis=1)

``````

## Step 1: Standardize each column

Subtract each column by its own mean. As a result, the mean of each column becomes zero.

``````X_standard = X - X.mean()
X_standard
``````

## Step 2 Compute Covariance Matrix

But what is the covariance and covariance matrix?

Covariance measures how two variables are related to each other, that is, if two variables are moving in the same direction with respect to each other or not. When covariance is positive, it means, if one variable increases, the other increases as well. The opposite true when covariance is negative. The covariance matrix calculates the covariance of all possible combinations of columns. As a result, it becomes a square matrix with the same number of rows and columns.

But, How to actually compute the covariance matrix in Python?

Using pandas dataframe, covariance matrix is computed by calling the `df.cov()` method.

``````df_cov = X_standard.cov()
print(df.shape)
``````

## Step 3: Compute Eigen values and Eigen Vectors

Eigenvalues and Eigen vectors represent the amount of variance explained and how the columns are related to each other. The length of Eigenvectors is one.

``````from numpy.linalg import eig
eigvalues, eigvectors = eig(df_cov)
print(eigvalues[:10])
print(eigvectors.shape)
#> [764074.70509694+0.j 333409.88084323+0.j 206248.30545139+0.j
#>  190073.11430869+0.j 149373.67054257+0.j 140893.79592355+0.j
#>  103478.87643359+0.j  87549.48702637+0.j  82347.51266608+0.j
#>  66776.33611636+0.j]

#> (784, 784)
``````

The `j` in the above output implies the resulting eigenvectors are represented as complex numbers.

## Step 4: Derive Principal Component Features by taking dot product of eigen vector and standardized columns

``````X_pca = np.dot(X_standard, eigvectors)
df_pca_calc = pd.DataFrame(X_pca)
The result is the Principal components, which, is the same as the PC’s computed using the `scikit-learn` package.
Congratulations if you’ve completed this, because, we’ve pretty much discussed all the core components you need to understand in order to crack any question related to PCA. PCA can be a powerful tool for visualizing clusters in multi-dimensional data. Plus, it is also while building machine learning models as it can be used as an explanatory variable as well. You saw the implementation in `scikit-learn`, the concept behind it and how to code it out algorithmically as well.