Working with large datasets
tooprock
I have been working on a data analysis routine which reads data from *.csv files, perform several statistical calculations on it and generates some useful easy to understand scientific outputs. One part of the routine handles 3D data containing many zeros and/or NANs. At the end I replace zeros with NANs and exclude NANs by using WaveTransform function. However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries. Could someone have a look at the code and give some programming clues which may solve this issue? I know it seems complicated and confusing. I would be appreciate for any helpful comment.
Cheers,
Emre
// This part chooses csv files one by one and calls another function to analyse them.
Function MyFunctionforFIA()
if(DataFolderExists("root:RDDataFolder")==1)
SetDataFolder root:RDDataFolder
else
print "Error! Please first load some data or press reset to delete prevoius data"
abort
endif
// Create sizeBinList and use in AnalyseWIBSFiles()
// make sizeBinList
Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102}
Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811}
Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451}
Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20}
Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist
ssb1 = numpnts(size_bins1)
ssb2 = numpnts(size_bins2)
ssb3 = numpnts(size_bins3)
ssb4 = numpnts(size_bins4)
ssizebinlist = ssb1+ssb2+ssb3+ssb4
Make/O/N=(ssizebinlist) SizeBinList
Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList
KillWaves size_bins1, size_bins2, size_bins3, size_bins4
SetDataFolder root:RDDataFolder
// Choose one file to analyse
Variable k
String DataFolderList = ""
for( k = 0; k < CountObjects("",4); k += 1)
DataFolderList += GetIndexedObjName("", 4, k)+":;"
endfor
// Calculate time required to analyse one file
Variable refnumTimer, timeElapsed, timeasprogress
// Read strings from list and set data folder accordingly
Variable numDataFolders = ItemsInList(DataFolderList)
Variable i
NVAR perc = root:perc
Variable progress = 0
ControlUpdate/W=toolKIT/A
for(i=0; i<numDataFolders; i +=1)
refnumTimer = startMSTimer
String nameCurDF = "root:RDDataFolder:" + StringFromList(i, DataFolderList)
SetDataFolder nameCurDF
AnalyseWIBSFluorescence(i) // Main function which performs analysis
SetDataFolder nameCurDF
timeElapsed = stopMSTimer(refnumTimer)
timeasprogress = timeElapsed/1000000 // elapsed time as seconds
//print "Last file took " + num2str(timeasprogress) + " seconds" // We can see how long it takes to analyse a file
progress += 1
perc = (progress / numDataFolders) * 100
ControlUpdate/W=toolKIT progress
// change progress color to green.
ValDisplay progress win=toolKIT, highColor= (26112,0,10240)
KillDataFolder/Z nameCurDF
endfor
// make destination waves
SetDataFolder root:RDDataFolder:
Make/O/N=(0,ssizebinlist-1,3) SizeResFluW
AppendWaves("", "", 0)
//MoveResultingData()
NewDataFolder/O/S root:FLData
if(!waveexists(root:FLData:SizeResFluW))
MoveWave root:RDDataFolder:SizeResFluW, root:FLData:
if(!waveexists(root:FLData:SizeBinList))
MoveWave root:RDDataFolder:SizeBinList, root:FLData:
endif
endif
End
// This part deals with time binning and decides on selection criteria. Output is used by another function to perform size binning
Function AnalyseWIBSFluorescence(whichData)
Variable whichData
Variable i, j
DFREF homeDFR = GetDataFolderDfr()
// wave references to global waves
Wave dateTimeWave = homeDFR:DateTimeWave
Wave sizeBinList = root:RDDataFolder:sizeBinList
Wave particleSizes = homeDFR:ParSize
Wave paramSet0 = homeDFR:Pwr_280
Wave paramSet1 = homeDFR:Pwr_370
Wave paramSet2 = homeDFR:FL1_280
Wave paramSet3 = homeDFR:FL2_280
Wave paramSet4 = homeDFR:FL2_370
NVAR Threshold1 = root:FTDataFolder:Threshold1
NVAR Threshold2 = root:FTDataFolder:Threshold2
NVAR Threshold3 = root:FTDataFolder:Threshold3
String suff
// print Threshold1, Threshold2, Threshold3 for debugging only
// each type is potentially composed of criteria from multiple parameters. List the waves containing those parameters in ";" strings
// here we'll use 5 parameters to define each of my made-up types
Make /FREE /T /N=3 typeParamNames
typeParamNames[0] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet2) // FL1_280 Type (Intensity[FL1_280] > Threshold && Intensity[Power_280 > 200])
typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200])
typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200])
// set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type
Make /FREE /D /N=(3, 3) minCriteria
Make /FREE /D /N=(3, 3) maxCriteria
// set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames
// FL1_280 criteria
minCriteria[0][0] = MinFailPower
maxCriteria[0][0] = inf
minCriteria[0][1] = Threshold1
maxCriteria[0][1] = inf
minCriteria[0][2] = minParSize
maxCriteria[0][2] = inf
// FL2_280 criteria
minCriteria[1][0] = MinFailPower
maxCriteria[1][0] = inf
minCriteria[1][1] = Threshold2
maxCriteria[1][1] = inf
minCriteria[1][2] = minParSize
maxCriteria[1][2] = inf
// FL2_370 criteria
minCriteria[2][0] = MinFailPower
maxCriteria[2][0] = inf
minCriteria[2][1] = Threshold3
maxCriteria[2][1] = inf
minCriteria[2][2] = minParSize
maxCriteria[2][2] = inf
// make the final output wave
Variable nRanges = dimSize(DateTimeWave,0)
Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW
Wave outWave = homeDFR:SizeResFluW
Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type
Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type
// build the output wave
// sortARange is the function which takes several inputs and gives one output which will be analysed to find
// particle counts, distributions, ratios, etc.
for(i=0; i<nRanges; i +=1)
for (j=0; j<3; j+=1)
currMins = minCriteria[j][p]
currMaxs = maxCriteria[j][p]
fluoreSizeBin(sizeBinList, particleSizes, currMins, currMaxs, typeParamNames[j], outWave, i, j)
endfor
endfor
// Move resulting wave and kill data folder
SetDataFolder root:RDDataFolder
suff = "DF_" + num2str(whichData)
suff = CleanupName(suff, 0)
NewDataFolder /O /S $suff
MoveWave homeDFR:SizeResFluW $suff
End
// This part deals with size binning according to the information generated by the Function above
Function fluoreSizeBin(binBoundsWave, parSizeWave, criteriaMin, criteriaMax, dataWaveNames, destinationWave, rowIn, layIn)
Wave binBoundsWave, parSizeWave
Wave criteriaMin, criteriaMax // wave of min and max values. must be same length as each other and of dataWaveNames
String dataWaveNames
Wave destinationWave
Variable rowIn, layIn
Variable nWaves = ItemsInList(dataWaveNames)
Variable j
Variable satisfied = 1
for(j=0; j<nWaves; j +=1)
Wave currDataWave = $(StringFromList(j, dataWaveNames))
if (currDataWave[rowIn] < criteriaMin[layIn] || currDataWave[rowIn] > criteriaMax[layIn])
satisfied = 0
//print currDataWave[i]
endif
endfor
if (satisfied)
FindLevel /Q/P binBoundsWave, parSizeWave[rowIn]
if (!V_flag)
destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn]
endif
endif
End
// This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points.
// Apparently this is where I got stuck.
Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF])
String parentDF, DFpattern
Variable killChildDFs
Variable sortedAppend
Variable sortOptions
String modelDF
// get list of childDFs matching DFpattern
if( strlen(parentDF) == 0 )
parentDF = GetDataFolder(1)
endif
String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" )
if( sortedAppend ) // default is 0 = false
childDFs = SortList(childDFs, ";", sortOptions) // default is 0 = ascending case-sensitive
endif // alphabetic ASCII sort
Variable numChildren = ItemsInList(childDFs) // count the "good" children
if( numChildren == 0 ) // ...none -- abort
Abort "Please first load some data to analyse"
endif
// build WList from waves in modelDF or in parentDF
string WList, saveDF = GetDataFolder(1)
if( !ParamIsDefault(modelDF) ) // use modelDF if specified
SetDataFolder modelDF
WList = WaveList("*", ";", "")
else
SetDataFolder parentDF // default to parentDF
WList = WaveList("*", ";", "")
endif
variable numWaves = ItemsInList(WList) // count waves to append to
if( numWaves == 0 ) // ...none -- abort
Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\""
endif
variable k, m, childDataAppended
string childDF, currentWave, sourceWave, destWave
// cycle through childDFs
for( k = 0; k < numChildren; k += 1 )
childDF = StringFromList(k, childDFs)
SetDataFolder childDF
childDataAppended = 0 // don't want to kill childDFs that contained no data to append
// ... cycle through WList
for( m = 0; m < numWaves; m += 1 )
currentWave = StringFromList(m, WList)
sourceWave = childDF+currentWave
Variable sizeofSourceWave = dimSize($sourceWave,0)
if( exists(sourceWave) == 1 ) // not all waves in child data folder may be relevant
destWave = parentDF+currentWave
//if(dimSize($sourceWave,0) > 1)
//if(dimSize($sourceWave,2) > 0)
Concatenate/NP=0 sourceWave+";", $destWave
//endif
//endif
childDataAppended = 1
endif
endfor
KillDataFolder childDF
endfor
SetDataFolder saveDF
End
// There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)
Function MyFunctionforFIA()
if(DataFolderExists("root:RDDataFolder")==1)
SetDataFolder root:RDDataFolder
else
print "Error! Please first load some data or press reset to delete prevoius data"
abort
endif
// Create sizeBinList and use in AnalyseWIBSFiles()
// make sizeBinList
Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102}
Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811}
Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451}
Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20}
Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist
ssb1 = numpnts(size_bins1)
ssb2 = numpnts(size_bins2)
ssb3 = numpnts(size_bins3)
ssb4 = numpnts(size_bins4)
ssizebinlist = ssb1+ssb2+ssb3+ssb4
Make/O/N=(ssizebinlist) SizeBinList
Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList
KillWaves size_bins1, size_bins2, size_bins3, size_bins4
SetDataFolder root:RDDataFolder
// Choose one file to analyse
Variable k
String DataFolderList = ""
for( k = 0; k < CountObjects("",4); k += 1)
DataFolderList += GetIndexedObjName("", 4, k)+":;"
endfor
// Calculate time required to analyse one file
Variable refnumTimer, timeElapsed, timeasprogress
// Read strings from list and set data folder accordingly
Variable numDataFolders = ItemsInList(DataFolderList)
Variable i
NVAR perc = root:perc
Variable progress = 0
ControlUpdate/W=toolKIT/A
for(i=0; i<numDataFolders; i +=1)
refnumTimer = startMSTimer
String nameCurDF = "root:RDDataFolder:" + StringFromList(i, DataFolderList)
SetDataFolder nameCurDF
AnalyseWIBSFluorescence(i) // Main function which performs analysis
SetDataFolder nameCurDF
timeElapsed = stopMSTimer(refnumTimer)
timeasprogress = timeElapsed/1000000 // elapsed time as seconds
//print "Last file took " + num2str(timeasprogress) + " seconds" // We can see how long it takes to analyse a file
progress += 1
perc = (progress / numDataFolders) * 100
ControlUpdate/W=toolKIT progress
// change progress color to green.
ValDisplay progress win=toolKIT, highColor= (26112,0,10240)
KillDataFolder/Z nameCurDF
endfor
// make destination waves
SetDataFolder root:RDDataFolder:
Make/O/N=(0,ssizebinlist-1,3) SizeResFluW
AppendWaves("", "", 0)
//MoveResultingData()
NewDataFolder/O/S root:FLData
if(!waveexists(root:FLData:SizeResFluW))
MoveWave root:RDDataFolder:SizeResFluW, root:FLData:
if(!waveexists(root:FLData:SizeBinList))
MoveWave root:RDDataFolder:SizeBinList, root:FLData:
endif
endif
End
// This part deals with time binning and decides on selection criteria. Output is used by another function to perform size binning
Function AnalyseWIBSFluorescence(whichData)
Variable whichData
Variable i, j
DFREF homeDFR = GetDataFolderDfr()
// wave references to global waves
Wave dateTimeWave = homeDFR:DateTimeWave
Wave sizeBinList = root:RDDataFolder:sizeBinList
Wave particleSizes = homeDFR:ParSize
Wave paramSet0 = homeDFR:Pwr_280
Wave paramSet1 = homeDFR:Pwr_370
Wave paramSet2 = homeDFR:FL1_280
Wave paramSet3 = homeDFR:FL2_280
Wave paramSet4 = homeDFR:FL2_370
NVAR Threshold1 = root:FTDataFolder:Threshold1
NVAR Threshold2 = root:FTDataFolder:Threshold2
NVAR Threshold3 = root:FTDataFolder:Threshold3
String suff
// print Threshold1, Threshold2, Threshold3 for debugging only
// each type is potentially composed of criteria from multiple parameters. List the waves containing those parameters in ";" strings
// here we'll use 5 parameters to define each of my made-up types
Make /FREE /T /N=3 typeParamNames
typeParamNames[0] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet2) // FL1_280 Type (Intensity[FL1_280] > Threshold && Intensity[Power_280 > 200])
typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200])
typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200])
// set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type
Make /FREE /D /N=(3, 3) minCriteria
Make /FREE /D /N=(3, 3) maxCriteria
// set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames
// FL1_280 criteria
minCriteria[0][0] = MinFailPower
maxCriteria[0][0] = inf
minCriteria[0][1] = Threshold1
maxCriteria[0][1] = inf
minCriteria[0][2] = minParSize
maxCriteria[0][2] = inf
// FL2_280 criteria
minCriteria[1][0] = MinFailPower
maxCriteria[1][0] = inf
minCriteria[1][1] = Threshold2
maxCriteria[1][1] = inf
minCriteria[1][2] = minParSize
maxCriteria[1][2] = inf
// FL2_370 criteria
minCriteria[2][0] = MinFailPower
maxCriteria[2][0] = inf
minCriteria[2][1] = Threshold3
maxCriteria[2][1] = inf
minCriteria[2][2] = minParSize
maxCriteria[2][2] = inf
// make the final output wave
Variable nRanges = dimSize(DateTimeWave,0)
Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW
Wave outWave = homeDFR:SizeResFluW
Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type
Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type
// build the output wave
// sortARange is the function which takes several inputs and gives one output which will be analysed to find
// particle counts, distributions, ratios, etc.
for(i=0; i<nRanges; i +=1)
for (j=0; j<3; j+=1)
currMins = minCriteria[j][p]
currMaxs = maxCriteria[j][p]
fluoreSizeBin(sizeBinList, particleSizes, currMins, currMaxs, typeParamNames[j], outWave, i, j)
endfor
endfor
// Move resulting wave and kill data folder
SetDataFolder root:RDDataFolder
suff = "DF_" + num2str(whichData)
suff = CleanupName(suff, 0)
NewDataFolder /O /S $suff
MoveWave homeDFR:SizeResFluW $suff
End
// This part deals with size binning according to the information generated by the Function above
Function fluoreSizeBin(binBoundsWave, parSizeWave, criteriaMin, criteriaMax, dataWaveNames, destinationWave, rowIn, layIn)
Wave binBoundsWave, parSizeWave
Wave criteriaMin, criteriaMax // wave of min and max values. must be same length as each other and of dataWaveNames
String dataWaveNames
Wave destinationWave
Variable rowIn, layIn
Variable nWaves = ItemsInList(dataWaveNames)
Variable j
Variable satisfied = 1
for(j=0; j<nWaves; j +=1)
Wave currDataWave = $(StringFromList(j, dataWaveNames))
if (currDataWave[rowIn] < criteriaMin[layIn] || currDataWave[rowIn] > criteriaMax[layIn])
satisfied = 0
//print currDataWave[i]
endif
endfor
if (satisfied)
FindLevel /Q/P binBoundsWave, parSizeWave[rowIn]
if (!V_flag)
destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn]
endif
endif
End
// This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points.
// Apparently this is where I got stuck.
Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF])
String parentDF, DFpattern
Variable killChildDFs
Variable sortedAppend
Variable sortOptions
String modelDF
// get list of childDFs matching DFpattern
if( strlen(parentDF) == 0 )
parentDF = GetDataFolder(1)
endif
String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" )
if( sortedAppend ) // default is 0 = false
childDFs = SortList(childDFs, ";", sortOptions) // default is 0 = ascending case-sensitive
endif // alphabetic ASCII sort
Variable numChildren = ItemsInList(childDFs) // count the "good" children
if( numChildren == 0 ) // ...none -- abort
Abort "Please first load some data to analyse"
endif
// build WList from waves in modelDF or in parentDF
string WList, saveDF = GetDataFolder(1)
if( !ParamIsDefault(modelDF) ) // use modelDF if specified
SetDataFolder modelDF
WList = WaveList("*", ";", "")
else
SetDataFolder parentDF // default to parentDF
WList = WaveList("*", ";", "")
endif
variable numWaves = ItemsInList(WList) // count waves to append to
if( numWaves == 0 ) // ...none -- abort
Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\""
endif
variable k, m, childDataAppended
string childDF, currentWave, sourceWave, destWave
// cycle through childDFs
for( k = 0; k < numChildren; k += 1 )
childDF = StringFromList(k, childDFs)
SetDataFolder childDF
childDataAppended = 0 // don't want to kill childDFs that contained no data to append
// ... cycle through WList
for( m = 0; m < numWaves; m += 1 )
currentWave = StringFromList(m, WList)
sourceWave = childDF+currentWave
Variable sizeofSourceWave = dimSize($sourceWave,0)
if( exists(sourceWave) == 1 ) // not all waves in child data folder may be relevant
destWave = parentDF+currentWave
//if(dimSize($sourceWave,0) > 1)
//if(dimSize($sourceWave,2) > 0)
Concatenate/NP=0 sourceWave+";", $destWave
//endif
//endif
childDataAppended = 1
endif
endfor
KillDataFolder childDF
endfor
SetDataFolder saveDF
End
// There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)
Does that mean you get an "out of memory" error message or what are the symptoms?
How large are those 3D waves? Are they double precision? Could they be in single precision?
May 6, 2013 at 08:31 am - Permalink
The 3-D matrices are created in fluoreSizeBin(...). Why again do you want to turn them back in to 1-D waves with AppendFluoreWaves(...)?
Where are you replacing 0's by NaN's ... I don't see that in your code? Have you thought about first transforming the 0's to NaN's in the CSV files using a text-editor?
--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
May 6, 2013 at 09:29 am - Permalink
First of all thanks for comments. I will try to explain why and how I do those calculations. Firstly, the routine works great for small datasets.
1) I get an "out of memory error" when I run the routine with large datasets.
2) I can't change zeros with NANs in the csv files because those zeros and/or NANs are output of several calculations. It means, original data may or may not contain zeros and/ot NANs. However this is not directly related to the output data which is a product of several csv files.
3) I don't convert them from 3-D matrices into 1D waves. I just append seperate outputs (one resulting wave from each csv file) into a single wave. However I have to decompose 3-D wave to 1D waves to be able to use WaveStats or StatsQuantiles or mean functions. They work only for 1D waves.
4) Only date time waves have to be in double precision but the others can have single precision.
5) I replace zeros with NANs in a separate function which I didn't put here. This function handles only statistical analysis.
6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.
So, you suggest using single precision waves instead of double precision to save some memory.
Some more information concerning how the entire toolkit runs:
i) User chooses one or more csv data. These data are read and saved into separate data folders. In each data folder, data points are read from csv files and saved in separate waves ( each column or let's say parameter as new wave).
ii) Routine works for each data folder one by one and the results are written in current data folders to prevent owerwriting.
iii) After that, resulting waves from separate data folders are appended into single waves and the remainder (data folders containing row files) is deleted.
Cheers,
Emre
May 7, 2013 at 05:32 am - Permalink
Well, this produces a wave of 360 MB
and this of 720 MB
Depending and what the data are and the precision you need that could be a possibility. Maybe even integer waves... it really depends on what you are working on. Otherwise have a look at
May 7, 2013 at 06:56 am - Permalink
May 7, 2013 at 08:02 am - Permalink
That's 1.8e9 data points. If it is single precision data, that's 7.2e9 bytes, which is more than twice the address space available to a 32-bit program. And your sequence of steps produces several waves and copies of waves. Within the 3-GByte address space available to a 32-bit program on 32-bit Windows, you also have to load Igor itself, have memory for graphs, etc.
If you have a 64-bit Windows, you have 4 GBytes available, which still isn't enough. But if you have 64-bit Windows you could use Igor64. Go to http://www.wavemetrics.net/ and look for the link to download the Igor64 installer. Be sure to read the notes about installing Igor64.
John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
May 7, 2013 at 09:55 am - Permalink
Cheers,
Emre
May 14, 2013 at 04:55 am - Permalink
For example, say you are interested in the mean and standard deviation of the data:
1) open the file using the IGOR open operation.
2) read in e.g. 10000 lines of data and parse into values
3) For the variables you are interested in calculate the number of values (should be 10000, but if you are discounting NaN/Inf it may be smaller), the sum of values and the sum of the squared values. From the 2nd iteration onwards you need to accumulate those sums.
4) goto 2) until you've read the data
5) calculate the statistics you want from the sum of the values and the sum of the squared values.
The point is that you never need to have the whole dataset in memory, just a subset.
May 14, 2013 at 11:16 am - Permalink
May 15, 2013 at 12:24 am - Permalink
May 15, 2013 at 06:32 am - Permalink
ImageStats is not ideal for per column calculations. Depending on the stats you need it may be far more efficient to use WaveStats with /R=[...] and possibly with /M=1.
A.G.
WaveMetrics, Inc.
May 15, 2013 at 10:11 am - Permalink
lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.
May 16, 2013 at 06:05 am - Permalink
I am aware of this possibility (I wrote it). I'm actually impressed that you found this relatively obscure feature. I still think that WaveStats/M=1 would execute faster. FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.
A.G.
WaveMetrics, Inc.
May 16, 2013 at 12:37 pm - Permalink
So using the /G flag wont scan the pixels more than once or? Seems good. I havent used it on enormous data sets, but on large tiff files it seems to execute as quickly as wavestats, and at least now he wouldnt have to recopy the data into a suitable 1D format for wavestats..
At any rate, didnt realize this was obscure. Wavestats is for 1D and you get sent right to Imagestats in the documentation for higher dimensionalities, and this command is right in the middle and seems to emulate wavestats which is probably why a lot of users got sent there in the first place, trying to use wavestates on the wrong kind of wave. Glad you put it in there!
May 20, 2013 at 03:18 am - Permalink
You can only find this feature if you read the documentation -- it is not available as a menu or dialog selection. Even if you come across this flag's documentation it is not immediately obvious why it is more efficient than using an ROI.
May 20, 2013 at 10:29 am - Permalink