Working with large datasets
tooprock
I have been working on a data analysis routine which reads data from *.csv files, perform several statistical calculations on it and generates some useful easy to understand scientific outputs. One part of the routine handles 3D data containing many zeros and/or NANs. At the end I replace zeros with NANs and exclude NANs by using WaveTransform function. However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries. Could someone have a look at the code and give some programming clues which may solve this issue? I know it seems complicated and confusing. I would be appreciate for any helpful comment.
Cheers,
Emre
// This part chooses csv files one by one and calls another function to analyse them. Function MyFunctionforFIA() if(DataFolderExists("root:RDDataFolder")==1) SetDataFolder root:RDDataFolder else print "Error! Please first load some data or press reset to delete prevoius data" abort endif // Create sizeBinList and use in AnalyseWIBSFiles() // make sizeBinList Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102} Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811} Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451} Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20} Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist ssb1 = numpnts(size_bins1) ssb2 = numpnts(size_bins2) ssb3 = numpnts(size_bins3) ssb4 = numpnts(size_bins4) ssizebinlist = ssb1+ssb2+ssb3+ssb4 Make/O/N=(ssizebinlist) SizeBinList Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList KillWaves size_bins1, size_bins2, size_bins3, size_bins4 SetDataFolder root:RDDataFolder // Choose one file to analyse Variable k String DataFolderList = "" for( k = 0; k < CountObjects("",4); k += 1) DataFolderList += GetIndexedObjName("", 4, k)+":;" endfor // Calculate time required to analyse one file Variable refnumTimer, timeElapsed, timeasprogress // Read strings from list and set data folder accordingly Variable numDataFolders = ItemsInList(DataFolderList) Variable i NVAR perc = root:perc Variable progress = 0 ControlUpdate/W=toolKIT/A for(i=0; i<numDataFolders; i +=1) refnumTimer = startMSTimer String nameCurDF = "root:RDDataFolder:" + StringFromList(i, DataFolderList) SetDataFolder nameCurDF AnalyseWIBSFluorescence(i) // Main function which performs analysis SetDataFolder nameCurDF timeElapsed = stopMSTimer(refnumTimer) timeasprogress = timeElapsed/1000000 // elapsed time as seconds //print "Last file took " + num2str(timeasprogress) + " seconds" // We can see how long it takes to analyse a file progress += 1 perc = (progress / numDataFolders) * 100 ControlUpdate/W=toolKIT progress // change progress color to green. ValDisplay progress win=toolKIT, highColor= (26112,0,10240) KillDataFolder/Z nameCurDF endfor // make destination waves SetDataFolder root:RDDataFolder: Make/O/N=(0,ssizebinlist-1,3) SizeResFluW AppendWaves("", "", 0) //MoveResultingData() NewDataFolder/O/S root:FLData if(!waveexists(root:FLData:SizeResFluW)) MoveWave root:RDDataFolder:SizeResFluW, root:FLData: if(!waveexists(root:FLData:SizeBinList)) MoveWave root:RDDataFolder:SizeBinList, root:FLData: endif endif End // This part deals with time binning and decides on selection criteria. Output is used by another function to perform size binning Function AnalyseWIBSFluorescence(whichData) Variable whichData Variable i, j DFREF homeDFR = GetDataFolderDfr() // wave references to global waves Wave dateTimeWave = homeDFR:DateTimeWave Wave sizeBinList = root:RDDataFolder:sizeBinList Wave particleSizes = homeDFR:ParSize Wave paramSet0 = homeDFR:Pwr_280 Wave paramSet1 = homeDFR:Pwr_370 Wave paramSet2 = homeDFR:FL1_280 Wave paramSet3 = homeDFR:FL2_280 Wave paramSet4 = homeDFR:FL2_370 NVAR Threshold1 = root:FTDataFolder:Threshold1 NVAR Threshold2 = root:FTDataFolder:Threshold2 NVAR Threshold3 = root:FTDataFolder:Threshold3 String suff // print Threshold1, Threshold2, Threshold3 for debugging only // each type is potentially composed of criteria from multiple parameters. List the waves containing those parameters in ";" strings // here we'll use 5 parameters to define each of my made-up types Make /FREE /T /N=3 typeParamNames typeParamNames[0] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet2) // FL1_280 Type (Intensity[FL1_280] > Threshold && Intensity[Power_280 > 200]) typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200]) typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200]) // set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type Make /FREE /D /N=(3, 3) minCriteria Make /FREE /D /N=(3, 3) maxCriteria // set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames // FL1_280 criteria minCriteria[0][0] = MinFailPower maxCriteria[0][0] = inf minCriteria[0][1] = Threshold1 maxCriteria[0][1] = inf minCriteria[0][2] = minParSize maxCriteria[0][2] = inf // FL2_280 criteria minCriteria[1][0] = MinFailPower maxCriteria[1][0] = inf minCriteria[1][1] = Threshold2 maxCriteria[1][1] = inf minCriteria[1][2] = minParSize maxCriteria[1][2] = inf // FL2_370 criteria minCriteria[2][0] = MinFailPower maxCriteria[2][0] = inf minCriteria[2][1] = Threshold3 maxCriteria[2][1] = inf minCriteria[2][2] = minParSize maxCriteria[2][2] = inf // make the final output wave Variable nRanges = dimSize(DateTimeWave,0) Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW Wave outWave = homeDFR:SizeResFluW Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type // build the output wave // sortARange is the function which takes several inputs and gives one output which will be analysed to find // particle counts, distributions, ratios, etc. for(i=0; i<nRanges; i +=1) for (j=0; j<3; j+=1) currMins = minCriteria[j][p] currMaxs = maxCriteria[j][p] fluoreSizeBin(sizeBinList, particleSizes, currMins, currMaxs, typeParamNames[j], outWave, i, j) endfor endfor // Move resulting wave and kill data folder SetDataFolder root:RDDataFolder suff = "DF_" + num2str(whichData) suff = CleanupName(suff, 0) NewDataFolder /O /S $suff MoveWave homeDFR:SizeResFluW $suff End // This part deals with size binning according to the information generated by the Function above Function fluoreSizeBin(binBoundsWave, parSizeWave, criteriaMin, criteriaMax, dataWaveNames, destinationWave, rowIn, layIn) Wave binBoundsWave, parSizeWave Wave criteriaMin, criteriaMax // wave of min and max values. must be same length as each other and of dataWaveNames String dataWaveNames Wave destinationWave Variable rowIn, layIn Variable nWaves = ItemsInList(dataWaveNames) Variable j Variable satisfied = 1 for(j=0; j<nWaves; j +=1) Wave currDataWave = $(StringFromList(j, dataWaveNames)) if (currDataWave[rowIn] < criteriaMin[layIn] || currDataWave[rowIn] > criteriaMax[layIn]) satisfied = 0 //print currDataWave[i] endif endfor if (satisfied) FindLevel /Q/P binBoundsWave, parSizeWave[rowIn] if (!V_flag) destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn] endif endif End // This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points. // Apparently this is where I got stuck. Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF]) String parentDF, DFpattern Variable killChildDFs Variable sortedAppend Variable sortOptions String modelDF // get list of childDFs matching DFpattern if( strlen(parentDF) == 0 ) parentDF = GetDataFolder(1) endif String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" ) if( sortedAppend ) // default is 0 = false childDFs = SortList(childDFs, ";", sortOptions) // default is 0 = ascending case-sensitive endif // alphabetic ASCII sort Variable numChildren = ItemsInList(childDFs) // count the "good" children if( numChildren == 0 ) // ...none -- abort Abort "Please first load some data to analyse" endif // build WList from waves in modelDF or in parentDF string WList, saveDF = GetDataFolder(1) if( !ParamIsDefault(modelDF) ) // use modelDF if specified SetDataFolder modelDF WList = WaveList("*", ";", "") else SetDataFolder parentDF // default to parentDF WList = WaveList("*", ";", "") endif variable numWaves = ItemsInList(WList) // count waves to append to if( numWaves == 0 ) // ...none -- abort Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\"" endif variable k, m, childDataAppended string childDF, currentWave, sourceWave, destWave // cycle through childDFs for( k = 0; k < numChildren; k += 1 ) childDF = StringFromList(k, childDFs) SetDataFolder childDF childDataAppended = 0 // don't want to kill childDFs that contained no data to append // ... cycle through WList for( m = 0; m < numWaves; m += 1 ) currentWave = StringFromList(m, WList) sourceWave = childDF+currentWave Variable sizeofSourceWave = dimSize($sourceWave,0) if( exists(sourceWave) == 1 ) // not all waves in child data folder may be relevant destWave = parentDF+currentWave //if(dimSize($sourceWave,0) > 1) //if(dimSize($sourceWave,2) > 0) Concatenate/NP=0 sourceWave+";", $destWave //endif //endif childDataAppended = 1 endif endfor KillDataFolder childDF endfor SetDataFolder saveDF End // There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)
Does that mean you get an "out of memory" error message or what are the symptoms?
How large are those 3D waves? Are they double precision? Could they be in single precision?
May 6, 2013 at 08:31 am - Permalink
The 3-D matrices are created in fluoreSizeBin(...). Why again do you want to turn them back in to 1-D waves with AppendFluoreWaves(...)?
Where are you replacing 0's by NaN's ... I don't see that in your code? Have you thought about first transforming the 0's to NaN's in the CSV files using a text-editor?
--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
May 6, 2013 at 09:29 am - Permalink
First of all thanks for comments. I will try to explain why and how I do those calculations. Firstly, the routine works great for small datasets.
1) I get an "out of memory error" when I run the routine with large datasets.
2) I can't change zeros with NANs in the csv files because those zeros and/or NANs are output of several calculations. It means, original data may or may not contain zeros and/ot NANs. However this is not directly related to the output data which is a product of several csv files.
3) I don't convert them from 3-D matrices into 1D waves. I just append seperate outputs (one resulting wave from each csv file) into a single wave. However I have to decompose 3-D wave to 1D waves to be able to use WaveStats or StatsQuantiles or mean functions. They work only for 1D waves.
4) Only date time waves have to be in double precision but the others can have single precision.
5) I replace zeros with NANs in a separate function which I didn't put here. This function handles only statistical analysis.
6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.
So, you suggest using single precision waves instead of double precision to save some memory.
Some more information concerning how the entire toolkit runs:
i) User chooses one or more csv data. These data are read and saved into separate data folders. In each data folder, data points are read from csv files and saved in separate waves ( each column or let's say parameter as new wave).
ii) Routine works for each data folder one by one and the results are written in current data folders to prevent owerwriting.
iii) After that, resulting waves from separate data folders are appended into single waves and the remainder (data folders containing row files) is deleted.
Cheers,
Emre
May 7, 2013 at 05:32 am - Permalink
Well, this produces a wave of 360 MB
and this of 720 MB
Depending and what the data are and the precision you need that could be a possibility. Maybe even integer waves... it really depends on what you are working on. Otherwise have a look at
May 7, 2013 at 06:56 am - Permalink
May 7, 2013 at 08:02 am - Permalink
That's 1.8e9 data points. If it is single precision data, that's 7.2e9 bytes, which is more than twice the address space available to a 32-bit program. And your sequence of steps produces several waves and copies of waves. Within the 3-GByte address space available to a 32-bit program on 32-bit Windows, you also have to load Igor itself, have memory for graphs, etc.
If you have a 64-bit Windows, you have 4 GBytes available, which still isn't enough. But if you have 64-bit Windows you could use Igor64. Go to http://www.wavemetrics.net/ and look for the link to download the Igor64 installer. Be sure to read the notes about installing Igor64.
John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
May 7, 2013 at 09:55 am - Permalink
Cheers,
Emre
May 14, 2013 at 04:55 am - Permalink
For example, say you are interested in the mean and standard deviation of the data:
1) open the file using the IGOR open operation.
2) read in e.g. 10000 lines of data and parse into values
3) For the variables you are interested in calculate the number of values (should be 10000, but if you are discounting NaN/Inf it may be smaller), the sum of values and the sum of the squared values. From the 2nd iteration onwards you need to accumulate those sums.
4) goto 2) until you've read the data
5) calculate the statistics you want from the sum of the values and the sum of the squared values.
The point is that you never need to have the whole dataset in memory, just a subset.
May 14, 2013 at 11:16 am - Permalink
May 15, 2013 at 12:24 am - Permalink
May 15, 2013 at 06:32 am - Permalink
ImageStats is not ideal for per column calculations. Depending on the stats you need it may be far more efficient to use WaveStats with /R=[...] and possibly with /M=1.
A.G.
WaveMetrics, Inc.
May 15, 2013 at 10:11 am - Permalink
lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.
May 16, 2013 at 06:05 am - Permalink
I am aware of this possibility (I wrote it). I'm actually impressed that you found this relatively obscure feature. I still think that WaveStats/M=1 would execute faster. FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.
A.G.
WaveMetrics, Inc.
May 16, 2013 at 12:37 pm - Permalink
So using the /G flag wont scan the pixels more than once or? Seems good. I havent used it on enormous data sets, but on large tiff files it seems to execute as quickly as wavestats, and at least now he wouldnt have to recopy the data into a suitable 1D format for wavestats..
At any rate, didnt realize this was obscure. Wavestats is for 1D and you get sent right to Imagestats in the documentation for higher dimensionalities, and this command is right in the middle and seems to emulate wavestats which is probably why a lot of users got sent there in the first place, trying to use wavestates on the wrong kind of wave. Glad you put it in there!
May 20, 2013 at 03:18 am - Permalink
You can only find this feature if you read the documentation -- it is not available as a menu or dialog selection. Even if you come across this flag's documentation it is not immediately obvious why it is more efficient than using an ROI.
May 20, 2013 at 10:29 am - Permalink