Principal Component Analysis (PCA)
Dear all,
I am faced with the small task of identifying from a series of 20 parameters extracted from a series of events, which parameters are dominant in describing variation between events, and which tend to be more irrelevant.
For that I looked through the Scatter Plot Matrix Demo which was a neat overview of data correlation, but what I'm looking for is some analytical measure which I want to apply to several datasets where we've perturbed our system and do comparisons across these datasets, etc. From what I've seen PCA is what I should be looking at.
In the meantime I've checked this excellent primer on PCA by Jonathon Shlens, and I got one the books referenced in Igor's help section: "Factor Analysis in Chemistry" by ER Malinowski (3rd edition), which I have begun reading.
Since I'm new to PCA I have the impression this might take some time. If anyone has some experience in doing PCA analysis in identifying which are the two main components in similar type of analysis using Igor, please leave me your thoughts.
Many thanks,
R.
Hello R.,
You may want to take a quick look at the PCA demo (File Menu->Example Experiments->Analysis->PCADemo). Note that Igor also has an ICA operation but the PCA is probably what you want.
A.G.
WaveMetrics, Inc.
July 19, 2018 at 01:17 pm - Permalink
Hey A.G.,
Thanks for the note. Yes, I did have a look at that example, and did attempt to run my data through it also.
Please have a look at the image attached.
What I did so far was the following:
To me, it seems I might have a lot of redundancy in my data. Is this a fair conclusion? However, I am still not sure how to connect the observation of more or less redundant parameters with the exact identification of which parameters they are. Could you give me some insight?
Many thanks!!
R.
July 20, 2018 at 07:55 am - Permalink
Hello R.,
If I only look at the eigenvalues, it seems that you have only two significant factors that together account for 99.8% of the variance.
I think it is important to note that even if you determine that there are only two important factors, these may not necessarily map to your input data. Since I am not familiar with the details of your application let me try and use an example from physics: suppose you have a bunch of waves containing xyz distribution in space and you compute the PCA and find that 99.85 of the variation is explained by two factors. That would imply that your distribution is pretty much planar. To determine the orientation of this plane in 3D space you need to look at the first two eigenvectors of the solution.
If you carry this analogy to your application, you might want to look at the eigenvectors and your inputs. In complicated cases it may be helpful to compute the projections (dot product) of your input vectors on the first two eigenvectors so you get a sense of what's meaningful and what is noise.
A.G.
July 20, 2018 at 04:24 pm - Permalink
Hi A.G.,
Many thanks for the input. Indeed, the events I included in this analysis are from one single "run" of measurements (i.e. they refer to a single subject being observed). Essentially, I am looking at the same subject at different instances, so the fact that I have 2 significant factors accounting for almost 99.85% of data variance, is likely to suggest that whenever subject is stimulated, a reaction tends to be fairly homogeneous with respect to most of the parameters analyzed. This is something I anticipated by looking at normalized distributions for each of the parameters. The situation will be different as I choose to feed PCA with averages of different runs of the experiment, where I include different subjects. So, to me, these preliminary results seem to be going the right way.
The part I am mostly struggling right now, is on converting this variance information, into to PCA scores for each of the parameters. Would you be able to help me on this part?
Many thanks,
Ricardo
July 23, 2018 at 12:15 am - Permalink
Hello Ricardo,
My approach is to think about the eigenvectors as a set of orthonormal vectors that span your data space. In your example, we determined based on the data that we only care about two eigenvectors (say e0 and e1). To represent your data using the two new axes you need to compute the dot product (or projection) of your parameter columns with the first two eigenvectors. For this to make sense you need to "standardize" your parameter columns by subtracting their mean and dividing by their standard deviation.
If you denote the standardized columns by P_i then you will be effectively creating a new representation of your data as P_i=c0i*e0 + c1i*e1 where the coefficients c0i and c1i are the projections of P_i on the respective eigenvectors.
You can use MatrixOP to perform these calculations. For example, to standardize a 1D parameter wave you can execute
MatrixOP/O P_0=normalize(subtractMean(w,1))
To calculate the projection:
MatrixOP/O c10=e1.P_0
which gives you {c00,c10} as the representation of the first parameter column in the new space.
I hope this helps,
A.G.
July 24, 2018 at 02:12 pm - Permalink