Virtually every project we work on at Firefinch involves data analysis in one way or another, and a great technique we like to use is Principal Component Analysis (PCA).
After recently using the technique as part of classifying some cytometry data, I wanted to write about why we find it such a workhorse tool. And to demonstrate how a comparatively intuitive and straightforward method can be helpful in getting good insights from your data set.
Amongst other things PCA can be used in exploratory data analysis to help understand what information is hiding in the data, it can help clean-up data by reducing noise in the data, and it can be used for feature extraction as part of a data processing pipeline or classification model. An acquaintance of mine once impressively managed to use PCA to turn 3000 variables of FTIR data into a model which classified cheese as either mature or extra mature.
PCA also has the advantage of making very few assumptions about your data, namely only that it is linear, that there is a degree of redundancy in the dataset, and that the signal of interest makes up the bulk of the variance in your dataset. We’ll revisit the reasons for these assumptions later, so for now we’ll list them as:
What does PCA actually do?
It rotates your data. More specifically, it calculates an orthogonal transform (a rotation) which when applied to your data expresses the data in a new way.
Instead of your original variables, you get the same number of new ‘composite’ variables back. The difference being that before you performed the PCA, the variables you had might be correlated to one another. What PCA attempts is to produce completely uncorrelated variables which are called Principal Components. Furthermore, the Principal Components themselves are ordered such that the first component captures the largest amount of variance in the data set as possible, the second the next highest level of variance and so on.
In a two dimensional dataset, this is equivalent to the first principal component (i.e. a new x-axis) being the line of best fit through the data, and the second principal component (i.e. a new y-axis) being a line perpendicular to that.
The rotation distance for each component compared to the original variables is referred to as its eigenvector, and the amount of variance captured by the principal component is its eigenvalue.
For example, if I had a dataset which included measurements of both height and shoe size of various individuals we can assume there is a degree of correlation between these variables – i.e. a good chance that taller people have larger feet and vice versa. After PCA we would expect to see that both these variables had been bundled together into a single component expressing the more fundamental concept of “Size”. And if “Size” explained the majority of the variation in the data, then it would be the first Principal Component.
Our first assumption in the list above (linearity) is imposed here as the orthogonal rotation itself is a linear transformation. In fairness, various non-linear implementations of PCA do exist. But without the constraint of linearity the problem is vastly more difficult to compute and calculate, and traditional PCA is always linear.
Why would you want to do that?
It’s actually very useful! Well, it is if your data conforms to our second assumption, that the largest source of variance in the data is what you actually are trying to measure. In general when capturing measurements from the real world, your data consists of both signal (the information you are trying to capture) and noise (unwanted variance in the measurements). If your method for capturing information (be it a piece of lab equipment, a survey, images, or whatever) is any good at all, then it is reasonable to assume that there will be more signal than noise in the measurements.
So by re-expressing the data as uncorrelated variables ordered by how much variance they capture, you are in effect partitioning the “Signal” in the data into the first few Principal Components and relegating the “Noise” portion of the data to later components, reducing the number of variables you require for analysis.
In our example above we posit the idea of “Size” being something which would exist in physical measurements about people, which isn’t a great surprise. But what about datasets where you don’t have an intuitive understanding of what properties exist? How can you find out what the real hidden parameters in the dataset are? In other words, what is the signal in the data?
As an example, the Oxford Internet Survey 2013 report was a survey which asked 2000 people in Britain about how they used the internet. It is a monster 27 page survey filled with a variety of different questions about people’s attitudes towards the internet.
From the dataset the researchers uncovered four key parameters which could be used to describe peoples attitudes towards the internet, which they termed:
1. Enjoyable escape: providing an enjoyable activity that is a good way to pass time and to escape from day-to-day activities, meet people, and not feel alone;
2. Instrumental efficiency: by making life easier, such as providing ways to save time, for example by finding information quickly;
3. Social facilitation: helping you keep in touch with friends, such as helping people to find information about you, and making it easier to meet people;
4. Problems: such as being frustrating to work with, wasting time with email, creating difficulties in controlling personal information, and exposing people to too much immoral material.
What these in fact are (if you skip to page 61 in the report), are the names they gave to describe the first four principal components in the data. PCA allowed them to bundle together the answers that correlated, and effectively compress 27 pages worth of answers down to just 4 numbers.
This “compression”, called dimensionality reduction, is possible due to the last of our assumptions, that there is redundancy in the data.
How does PCA actually work?
If you are interested in understanding how the rotation is calculated from the original data in more detail, then I heartily recommend this excellent paper A Tutorial on Principal Component Analysis by Google Research’s Jonathon Shlens. There are lots of other articles available as well, but Jonathon does an excellent job of explaining the entire process and the rationale behind it in an intuitive 10 pages or so.
How do you do it?
Let’s do a worked example in R!
We will be using the inbuilt R function prcomp. (There are alternative functions, including the second in-built function princomp which uses slightly different maths internally for the calculation. But for now we’ll stick with the more popular option).
Ok, so data first. We’ll use the inbuilt dataset mtcars for our example. This is a dataset from “Motor Trend US magazine” back in 1974, and shows data about performance (like fuel consumption) and engine design parameters (like number of cylinders) for a number of different cars.
The data has 32 observations of 11 different variables, which are:
- mpg: Miles per gallon
- cyl: Number of cylinders
- disp: Piston Displacement
- hp: Horsepower
- drat: Rear axle ratio
- wt: Weight (1000 lbs)
- qsec: 1/4 mile time (seconds)
- vs: V-engine or Straight (inline) engine (1 = Straight, 0 = V-Shaped)
- am: Transmission (1 = Manual, 0 = Automatic)
- gear: Number of forward gears
- carb: Number of carburettors
And here it is:
We will ignore the two categorical variables am and vs for now as PCA works best with numerical data. That leaves us with 32 observations of 9 different variables. We can use the categorical variables later to see what our analysis tells us about engine types and transmissions.
To run the analysis we pass the dataset to prcomp and put the output into example_pca. We are also setting two parameters here to centre and scale the data.
mtcars_no_categorical<-mtcars[,c(1:7,10,11)] example_pca<-prcomp(mtcars_no_categorical, center = TRUE,scale. = TRUE)
Centring the data subtracts the mean of the variable from each individual measurement, making the mean of the variable zero. This doesn’t affect the variance in the data itself, but in the classical calculation for PCA the data is naturally centred as a side-effect of calculating the covariance matrix. prcomp uses a more convenient method of calculation called Singular Value Decomposition, and in which centering the data is an additional step. In general, you should always leave this on (unless you centered the data yourself).
Scaling the data rescales the variables such that one ‘unit’ is equivalent to one standard deviation. This is important as not all variables are measured in the same base units, for example in our dataset above the cylinder displacement and horsepower values (disp and hp) are orders of magnitude larger than the values for say, number of cylinders (cyl). In PCA we aren’t interested in the values, but the variance of the values, and if we don’t rescale them, then these variables will be over-represented in the analysis simply because the numbers are larger. Put another way, if we had a dataset with measurements in kilograms and changed the units to milligrams, then if we didn’t scale the variance the variable would suddenly become a million times more important. In general it is a good idea to leave this on.
Both operations together are equivalent to ordinary standardisation.
There’s quite a bit of interesting output in the object returned from the PCA.
First we can see the output data itself in the form of the Principal Components. It is found at example_pca$x:
Note that we still have a 32 observations of 9 different variables as we did at the start.
We can also see the rotation that was applied to generate the new Principal Components as example_pca$rotation.
And we can look at how the variance in the dataset has been apportioned across the Principal Component, both as a table and visually.
So we can see from the Cumulative of Proportion row in the table that ~63% of all the variation in the data can be explained by one variable (PC1), ~86% by PC1 and PC2, and so on.
Let’s have a closer look at the first 3 Principal Components, which capture ~92% of the information in the original 9 variables.
We can start by copying back in the categorical variables we excluded earlier. For ease of visualisation we’re also going to collapse them into a single “engine_classification”.
PC_data<-as.data.frame(example_pca$x) PC_data$transmission=ifelse(mtcars$am,"Manual","Automatic") PC_data$engine=ifelse(mtcars$vs,"Straight Engine","V-Shaped Engine") PC_data$engine_class=paste(PC_data$transmission,PC_data$engine)
And then we can plot the first three Principal Components, indicating the transmission and engine class.
engine_class_chart<-plot_ly(PC_data, x=~PC1, y=~PC2, z=~PC3, color=~engine_class, type="scatter3d", mode="markers", text=row.names(PC_data)) %>% layout(title="First three Principal Components indicating engine classification") embed_notebook(engine_class_chart)
So at this point we have successfully reduced our nine variable dataset to just three variables which do a fairly good job of clustering the data points by engine class. It’s certainly a lot more convenient than trying to plot nine dimensions!
But how can we find out which variables in our original dataset are contributing to which Principal Components?
We can use a biplot to look at a visual representation of the loadings, which can be generated using the biplot function. The factoextra package for R also provides additional methods for looking at visualisations of the PCA output.
This biplot is showing two things at the same time. Firstly it is effectively a scatter plot of our data points for PC1 and PC2 (shown in black). Overlaid on top of this are the loadings for the different variables. The loading is the product of the eigenvector and eigenvalue we calculated as part of the rotation, so the arrows represent where the original variables lie in relation to the principal components when projected onto a two dimensional surface.
What this plot shows is that hp, cyl, and disp for example, all contribute strongly to a positive value in PC1.
Another approach is to look at the rotation and determine the largest contributing variables, again we can see below that hp, cyl, and disp have high eigenvector values for PC1.
As I’ve hopefully illustrated, PCA is a great technique for understanding your data. I hope you have fun applying it to your own datasets.