Principal component Analysis(PCA) Part1

5 min readSep 17, 2021

Motivation

Principal component analysis (PCA) is a technique to reduce the number of features of a machine learning problem, also known as the problem dimension, while trying to maintain most of the information of the original dataset. The two main applications of dimensionality reduction by PCA are:

Visualization of high-dimensional data.
Pre-processing of data to accelerate model training and reduce memory requirements

This blog introduces PCA, explains how it works, and applies it to an wine dataset problem.

Intuitive explanation

PCA generates new features, called principal components, that are linear combinations of the original features. The principal components are designed with two objectives:

Differentiate between instances. The value of the principal component should vary as much as possible between instances. Mathematically, this objective is equivalent to maximizing the variance.
Summarize the data by attempting to (only) eliminate redundant information. It should be possible to predict, or rebuild, the original features from the main principal components. When we transform or project the features into principal components, the mathematical objective is to minimize the average squared projection error.

Surprisingly, these two objectives are equivalent. The reasons for this are best understood by considering an example. Figure 1 shows a data set with two features: x1 and x2. As there are two features, we can get up to two principal components. The first principal component is depicted as a green arrow and maximizes the variance as follows. If the instances are projected onto a straight line, then they are on average as far as possible projected into the first principal component. The projection error is the average squared distance between the instances and the green arrow, which is also minimized by the first principal component.

Figure 1:Principal component of a dataset with two features

If we consider the first principal component a sufficiently accurate approximation of the two features, we could replace the two features by only the first principal component, effectively reducing the problem dimension.

Wine data Using PCA

We will apply principal component analysis to the wine data set. If you are not familiar with this data set, it has 178 rows and 14 columns and we will get it from scikit-learn datasets. Let’s Import our libraries first

import numpy as npimport pandas as pd# Import datasetfrom sklearn.datasets import load_winefrom sklearn.preprocessing import StandardScalerfrom sklearn import preprocessingfrom sklearn.decomposition import PCAfrom sklearn.model_selection import train_test_splitimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

Using our wine dataset:

data = load_wine()df = pd.DataFrame(data.data, columns=data.feature_names)df['target'] = data['target']df

Removing the target variable(wine type):

wine = df.drop('target', axis =1)

Scaling the predictors:

wine_scaled = preprocessing.scale(wine)

Build the model:

# define PCA objectpca = PCA()# fit the PCA model to our data and apply the dimensionality reductionprin_comp = pca.fit_transform(wine_scaled)# create a dataframe containing the principal componentspca_df = pd.DataFrame(data = prin_comp)pca_df["target"] = df["target"]

Our matrix of principal components:

pca_df

How well does each PC capture the variance in the data set?

# plot line graph of cumulative variance explainedplt.plot(np.cumsum(pca.explained_variance_ratio_))plt.xlabel('Number of components')plt.ylabel('Cumulative explained variance')

pca_85 = PCA(.85)pca_85.fit_transform(wine_scaled)print(round(pca_85.explained_variance_ratio_.sum()*100, 1),"% of variance explained by",pca_85.n_components_,"components.")
85.1 % of variance explained by 6 components.pca.explained_variance_ratio_[:3]array([0.36198848, 0.1920749 , 0.11123631])

Visualising our 13D data in 2D:

ax = sns.scatterplot(x=pca_df[0], y=pca_df[1],hue = 'target',data=pca_df,legend=True)plt.show()

This is cool, but how do we know what is contributing to the spatial separation? Enter the biplot:

Biplot analysis is a multivariate analysis that trying to compress information and showing them in Cartesian coordinate using the Principal Component Analysis (PCA). To identify the variance of components, it’s necessary to calculate the eigenvalue. This eigenvalue is displayed in figure3

#!pip install pcafrom pca import pcaX_new = pd.DataFrame(data=load_wine().data, columns=load_wine().feature_names, index=load_wine().target)wine_scaled_2 = preprocessing.scale(X_new)wine_scaled_2

# Initialize to reduce the data up to the number of components that explains 85% of the variance.model = pca(n_components=0.85)# Or reduce the data towards 2 PCs#model = pca(n_components=2)# Fit transformresults = model.fit_transform(wine_scaled_2)# Plot explained variancefig, ax = model.plot()

# Scatter first 2 PCsfig, ax = model.scatter()

model.biplot()

# Make biplot with the number of featuresfig, ax = model.biplot(n_feat=3, legend=False)

Limitations

As any other machine learning technique, PCA has some known limitations:

PCA only looks for linear correlation between the features. It will not work effectively if the correlation between the features is not linear.
An underlying assumption of PCA is that the principal component with the highest variance will be the most useful for solving our machine learning problem (for example, predicting the class of an instance). This assumption, although logical, is not always correct.

Conclusion

This blog post has explained principal component analysis and how to apply it. In part2 we will look at principal component analysis again but with a different datasets and build machine learning models.

GitHub

Wikipedia

scikit-learn

Thank you for reading.

Please let me know if you have any feedback.