Rearranging vertical data based on string labels
I have a source of data (as CSV) where all the numerical values are in one column, and another column contains labels describing the data. I need to rearrange the data into separate waves for each label. For example, if this were repeated measurements of devices A and B, the current format would be something like this text wave and numeric wave:
A,1.2
A,1.8
B,1.5
A,2.4
B,1.9
B,2.6
and I would want to have two numeric waves called wave A={1.2,1.8,2.4} and B={1.5,1.9,2.6}. In my actual data set, there may be a couple million lines and many more labels than just two, but this is a small example. The data set is set up for analysis in JMP where this data table manipulation is done simply with a "subset" menu command, but I'm having trouble doing it efficiently in Igor. I'm trying to avoid a big loop over elements.
So far I am using FindDuplicates/RT to determine all the unique labels, then looping through those labels and using Extract to create the new waves. But that was rather slow when using a StringMatch test for the Extract operation. I sped it up a bit by first converting the labels to numbers using StringCRC, then doing the Extract loop, but it still seems more cumbersome than necessary.
Is there an efficient/easy approach to this? Or is there a more clever way to do it using Duplicate/R or some MatrixOP trickery? My working example is below:
Make/T labels={"a","a","b","a","b","b"}
Make values={1.2,1.8,1.5,2.4,1.9,2.6}
FindDuplicates/RT=UniqueLabels labels
Make/O/N=(numpnts(labels))/FREE LabelsCRC=StringCRC(0,labels)
Make/O/N=(numpnts(UniqueLabels))/FREE UniqueLabelsCRC=StringCRC(0,UniqueLabels)
//Loop through the devices (labels) and put the matching data in waves with the label name
variable i=0
variable imax=numpnts(UniqueLabels)
string Wname=""
for(i=0;i<imax;i+=1)
Wname=UniqueLabels[i]
Extract values, $wname, LabelsCRC==UniqueLabelsCRC[i]
endfor
KillWaves UniqueLabels
end
I cannot think of an implicit loop that would replace the explicit for-loop. The only starting point that I can call up that is different is to sort the letter wave with a dependency on the value wave. This puts the data in "blocks" rather than in random order.
January 20, 2020 at 02:04 pm - Permalink
I'm open to sorting the data before processing, and in fact I do have a Sort in place beforehand for my more complex data set could put everything in blocks. I don't think Extract cares if they are ordered in blocks for my described approach, but if the sort would enable a simpler approach (like FindLevels to determine where the CRC-coded labels change value?) then let me know!
January 20, 2020 at 03:43 pm - Permalink
My approach would be to convert the strings into numbers. CRC or any other hash would make it simple. You follow that with a Sort/R for both the converted text wave and your numerical wave. Then use Something like EdgeStats to figure out the boundaries of the groups.
If EdgeStats is not efficient and if you know the number of groups you could sift the data using MatrixOP setNaNs() and then WaveTransform zapNaNs.
I hope this helps,
A.G.
January 20, 2020 at 05:02 pm - Permalink
I think AG's approach is the best you can get except for one part. After you've sorted the CRC values, use Extract to get the unique groups. Something like this (untested)
Extract sortedSourceWave, outWave, (p==0 || sortedSourceWave[p-1] != sortedSourceWave[p])
Depending on what you're going to do with the output, you might want to use the /INDX flag with Extract to get the row indexes with unique values instead of the values themselves.
January 20, 2020 at 05:43 pm - Permalink
... and obviously, when you get IP9 you could simply run the new TextHistogram operation.
January 20, 2020 at 05:45 pm - Permalink