K means clustering on a data matrix
Vinni
I have a data matrix of 551 rows and 35 columns. Rows are time series and columns are elements. I would like to cluster those variables which are showing similar concentrations e.g. clustering of those elements which are high in concentrations on specific days. Please help!
Forum
Support
Gallery
Igor Pro 9
Learn More
Igor XOP Toolkit
Learn More
Igor NIDAQ Tools MX
Learn More
Can you be more specific on what your actual goal is? Are you interested in days where the elemental concentrations are generally high or where they are similar (that must not be the same thing)? I'm not sure if you really want clustering.
If you just wanted to know the days with highest concentrations you could simply look at the sum of all elements:
display W_Sumrows vs ts
ShowInfo
Then drag the cursor onto the spikes to identify date/time. One problem is that you have 4-5 orders of magnitude difference in maximum concentrations:
Edit M_WaveStats.ld
Is that of concern?
Another way of simply visualising concentrations is to display your data as an image, e.g.:
ModifyImage PM10 ctab= {*,*,Spectrum,0}
ShowInfo
April 8, 2019 at 01:16 am - Permalink
The range of magnitude variation can be handled using log scale.
If you decided on some threshold value (representing high concentration) you could simply threshold the image (using ImageThreshold or otherwise) to obtain the high cluster membership as a function of time.
If you mean something else by clustering please explain.
A.G.
April 8, 2019 at 10:59 am - Permalink
In reply to Can you be more specific on… by ChrLie
Hi ChrLie,
First I want to cluster whole dataset. I assume that with clustering I would be able to see the groups of elements where they would classify in different classes. Then I want to apply some conditions. My criteria is to cluster elements based on concentration value on hourly basis e.g. if Pb, Se and Cu are having highest peak at 02:00 hr, they should come in one group while the low values of these elements along with other elements could group in another class and so on.
I do not want to see the days with higher concentration but to cluster elements based on high spikes and low concentration for particular hours.
Thanks a lot!
April 17, 2019 at 02:45 am - Permalink
In reply to The range of magnitude… by Igor
Image plot would not give me much help here. I want to see the elemental concentration on hourly basis for each day. For that I would like to cluster high and low concentrations elements in separate classes as a function of hourly time for each day. e.g. one day at 02:00 some elements are super high while another day these are super low at 02:00 but high at 06:00 and so on.
April 17, 2019 at 02:58 am - Permalink
FWIW, I think the issue here is not clustering (K-Means or otherwise). The issue appears to be that of representation and not so much how the various elements are distributed between clusters. For a data set with 551 time points and 35 elements you could create a wave containing 35 threshold values (one for each element) and then apply these thresholds for all time points. A simple image of the threshold will show you the "clustering" of elements above threshold.
April 18, 2019 at 04:44 pm - Permalink
In reply to FWIW, I think the issue here… by Igor
I tried that. The command: ImageThreshold/T=2000 PM10 worked.
But I am not able to run wave that contain 35 threshold values. It shows incompatible dimensions error by executing the command:
ImageThreshold/W=Pmth PM10
where Pmth is the threshold wave and PM10 is the data matrix. Pmth wave is of 1*35 size.
Could you please help me for coding step by step?
April 23, 2019 at 03:14 am - Permalink
Please note the documentation for the flag /W; you need to have pairs of values. Your wave apparently has an odd number of values. Also, it needs to be a 1D wave.
April 23, 2019 at 07:50 am - Permalink