object name length limits, metaData, hashing, and searching
jamie
- names that exceed the 32 character limit on object names in Igor
- "liberal" names, which must be dealt with specially.
Each Igor wave has a waveNote, which is a logical place to store metaData. Finding a wave based on its wavenote, however, means looking at all the waves in the datafolder, which is slow if there are many waves, as it is a linear search. I wrote some code to use hash techniques to limit wavenames to 32 characters while still providing quick retrieval of a wave identified by associated metadata, which is stored in the waveNote.
In general, Hash maps offer data storage with quick insertion of key:value pairs and quick retrieval of the value for a given key. A hash map has a set of key:value pairs in an array. The position of each pair in the array is based on a hash of the key.
The size of the array can be much less than the size of the number of POSSIBLE keys, but must, of course, be at least as large as the number of ACTUAL key:value pairs. Similarly, the number of POSSIBLE meta-data based wavenames exceeds Igor's 32 character limits, but the number of ACTUAL waves in a data folder is not likely to exceed the number of 32 character long wavenames available. Thus, long strings of metaData can be hashed to shorter strings which are used as wavenames, while still allowing waves containing a particular combination of metaData to be found more quickly than with a linear search.
The analogy between a standard hash map and the hash of Igor waves:
Standard Hash Map--->Igor WaveName Hash
Key--------------------->metaData (stored in waveNote)
value ------------------->the wave itself, or a reference to it
array position---------->the name of the wave, where the name is in base37 using the legal Igor name characters (0 -9, a-z, _)
Some of the "user-facing" functions are described below:
//******************************************************************************************************
// Makes a wave named by a hash of the metaData string, puts the metadata in the waveNote, and returns a reference
// to that wave. If overWrite parameter is 0 and a wave already exists, returns a null reference and does not overwrite wave
function/Wave hashMakeWave (hashFolder, metaDataString, overWrite, wType, pSize, qSize, rSize, sSize, hashSize)
string hashFolder // path to the datafolder, the datafolder will be made if it does not exist
string metaDataString // string contaning unique metadata for this wave
variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave
variable wType // wavemetrics code for wavetype. 0 means a text Wave, 2 means 33 bit float, etc
variable pSize, qSize, rSize, sSize // size of dimensions for wave. use 0 for unused dimensions
variable hashSize // only used if making folder for the first time
//******************************************************************************************************
// renames/moves an existing wave to become part of the hash system
function hashAddWave (theWave, hashFolder, metaDataString, overWrite, hashSize)
wave theWave
string hashFolder // path to the datafolder, the datafolder will be made if it does not exist
string metaDataString // string contaning unique metadata for this wave
variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave
variable hashSize // only used if making folder for the first time
//******************************************************************************************************
// Gets a wave reference to a hashed wave based on metadata, or returns a null reference if the wave can not be found
function/Wave hashGetWave (hashFolder, metaDataStr)
string hashFolder
string metaDataStr
// Makes a wave named by a hash of the metaData string, puts the metadata in the waveNote, and returns a reference
// to that wave. If overWrite parameter is 0 and a wave already exists, returns a null reference and does not overwrite wave
function/Wave hashMakeWave (hashFolder, metaDataString, overWrite, wType, pSize, qSize, rSize, sSize, hashSize)
string hashFolder // path to the datafolder, the datafolder will be made if it does not exist
string metaDataString // string contaning unique metadata for this wave
variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave
variable wType // wavemetrics code for wavetype. 0 means a text Wave, 2 means 33 bit float, etc
variable pSize, qSize, rSize, sSize // size of dimensions for wave. use 0 for unused dimensions
variable hashSize // only used if making folder for the first time
//******************************************************************************************************
// renames/moves an existing wave to become part of the hash system
function hashAddWave (theWave, hashFolder, metaDataString, overWrite, hashSize)
wave theWave
string hashFolder // path to the datafolder, the datafolder will be made if it does not exist
string metaDataString // string contaning unique metadata for this wave
variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave
variable hashSize // only used if making folder for the first time
//******************************************************************************************************
// Gets a wave reference to a hashed wave based on metadata, or returns a null reference if the wave can not be found
function/Wave hashGetWave (hashFolder, metaDataStr)
string hashFolder
string metaDataStr
Now this is fine for finding a wave based on all the meta-data used to make the hash. It kind of falls down when trying to get all of the waves that match for a certain element of the metadata. Say we want to retrieve all waves with idCode 75 where we have metadata for date, idCode, and treatment. In the case where the waveNames are made directly from the metadata, we could use something like WaveList("*75*", ";", "" ). But with waveNames hashed on all of date, idCode, and treatment, there is no easy way to get such list. We must fall back to a linear search again, examining the metadata of all the waves in the folder in turn.
A quick and dirty workaround would be to hash each type of metadata separately and have the wavename be a concatenation of those hashes, which would allow use of the WaveList function after hashing just the metadata we were interested in listing for WaveList("*" + hash("75") + "*", ";", ""). But this might lead to wavenames longer than 32 characters pretty quickly. It looks like a simple hash is not flexible enough for many common uses.
Does anybody have any ideas about applying more database-like capabilities to the problem? Sets and Intersections and that kind of stuff?
there's nothing preventing you from creating a relational database in Igor. But I think this is stretching Igor's limits a bit too much, and that you'll end up with lots of work in exchange for poor to mediocre performance.
If I were you I'd start by looking into leaving the relational database functionality to a ... relational database. Have you considered using SQL.xop in combination with something like SQLite?
March 1, 2012 at 04:12 am - Permalink
Several years ago I wrote a package that I used for collecting patch clamp data from cells. Originally, for each trace I recorded, I saved stimulus and recording parameters as KEY:value; pairs in the wave note. The problem with doing this is that a lot of the bytes in the wave note (the KEY strings) are redundant, and my files were getting too large. So I later rewrote the program to create a 2D wave in which each column was a parameter I wanted to save and each row corresponded to a single trace. I then wrote a query engine that used the Extract operation with the data in this 2D wave (and actually several other similar 2D waves that contained other information). This worked well and was plenty fast for my needs but my queries couldn't get too complicated or they would exceed the maximum length of a command line or line in a procedure file (400 characters).
An alternative would be to scrap your hash table completely and go back to just searching through each wave's wave note--but with a twist.
Here is an example function:
Variable numWavesToGenerate
// Generate a bunch of waves in a
// new data folder.
NewDataFolder/O/S test
WAVE/WAVE generatedWaves = generateWaves(numWavesToGenerate)
// Start timing.
Variable start = StopMsTimer(-2)
// Instead of searching through a wave note,
// we'll just calculate the total number of points
// in all waves in the current data folder.
Variable totalNumPoints = 0
//////////////////////////////
// Slow way
//////////////////////////////
// Iterate through all the waves in the current
// data folder and get the number of points in the
// wave.
Variable numWaves = CountObjects("", 1);
Variable n
String thisWaveName
// Start timing the first loop.
Variable firstLoopStart = StopMsTimer(-2)
For (n=0; n < numWaves; n+=1)
thisWaveName = GetIndexedObjName("", 1, n)
if (strlen(thisWaveName) == 0)
break
endif
WAVE thisWave = $(thisWaveName)
totalNumPoints += numpnts(thisWave)
EndFor
// Stop timing the first loop.
Variable firstLoopStop = StopMsTimer(-2)
print "Total Num Points of first loop:", totalNumPoints
//////////////////////////////////
// Fast way
//////////////////////////////////
// Iterate through all waves in generatedWaves.
// Since this wave is a wave of wave references,
// we don't have to look up each wave in the current
// data folder by name, which gets slower the more
// waves a data folder contains.
totalNumPoints = 0
// Start timing the second loop.
Variable secondLoopStart = StopMsTimer(-2)
For (n=0; n < numWaves; n+=1)
WAVE thisWave = generatedWaves[n]
totalNumPoints += numpnts(thisWave)
EndFor
// Stop timing the second loop.
Variable secondLoopStop = StopMsTimer(-2)
// Stop timing.
Variable stop = StopMsTimer(-2)
print "Total Num points of second loop:", totalNumPoints
// Go back to the root data folder
SetDataFolder root:
// Kill the data folder containing the waves.
KillDataFolder root:test
// Print timing results.
Variable total = (stop - start)/1000
Variable first = (firstLoopStop - firstLoopStart)/1000
Variable second = (secondLoopStop - secondLoopStart)/1000
printf "%d waves (ms):: Overall: %g\t1st loop: %g\t2nd loop: %g\r", numWavesToGenerate, total, first, second
End
// Generates numToGenerate waves in the
// current data folder.
Function/WAVE generateWaves(numToGenerate)
Variable numToGenerate
Variable n
String thisWaveName
// Create a free wave to hold wave references to the
// generated waves.
Make/O/N=(numToGenerate)/WAVE/FREE waveOfWaves
For (n=0; n < numToGenerate; n+=1)
sprintf thisWaveName, "wave%.8d", n
Make/O $thisWaveName /WAVE=thisWave
waveOfWaves[n] = thisWave
EndFor
return waveOfWaves
End
The first loop iterates through all of the waves in the current data folder using GetIndexedObjName(). This is relatively slow because waves in a data folder are stored in a linked list. This is not a problem when there are not very many waves in a data folder, but when you have thousands of waves in the data folder, the time to iterate through all waves rapidly increases.
The second method uses the wave of wave references returned by the generateWaves() function and iterates through that wave. In this case the time to iterate through the waves is roughly linear, and it is much faster even for fewer waves because there is no need to lookup the waves from the linked list.
On a development build of Igor 6, here are the timings I get:
•test(5000) Total Num Points of first loop: 640000 Total Num points of second loop: 640000 5000 waves (ms):: Overall: 681.901 1st loop: 679.458 2nd loop: 1.19863 •test(10000) Total Num Points of first loop: 1.28e+06 Total Num points of second loop: 1.28e+06 10000 waves (ms):: Overall: 2719 1st loop: 2715.07 2nd loop: 2.54419 •test(20000) Total Num Points of first loop: 2.56e+06 Total Num points of second loop: 2.56e+06 20000 waves (ms):: Overall: 20047 1st loop: 20037.8 2nd loop: 5.97675
Note that in the real world, you would need to generate a wave of WAVE references and store that wave somewhere. Generating that wave the first time could be slow if there are lots of waves, but you only need to generate it once (unless you are adding/deleting waves in the data folder of interest).
Depending on how you are doing the search, it might even be possible to do the search using multiple threads, which could speed things up even further.
March 1, 2012 at 09:14 am - Permalink
UBC In Vivo Imaging Core
Thanks for the ideas.
Adam, it was a bit of a surprise to me to see that GetIndexedObjName(sourceFolderStr, objectType, ii) runs in linear time, traversing a linked list. I wrote functions that trawl through all the waves in a folder with getIndexedObjectName, doing matching based on wave names, or wave notes, and indeed, they are big O of n^2.
There must be some way to maintain a sorted structure of pointers to waves so you could get the nth wave without having to traverse a whole list. That's where you idea of maintaining your own database of wavereferences is interesting. You could have several lists of references to the same wave, sorted by different metadata. That is, pair each waveReference wave with a wave of corresponding metadata, and sort both waves by the metadata wave. You could find all the data waves corresponding to a range of values for some metadata quite quickly then, by binary search on the metadata wave. But as you point out, you would have to make and maintain the list of wave references.
By contrast, the WaveList(matchStr, separatorStr, optionsStr ) function runs in linear time; it can do the check for the wavename matching on each wave as it traverses the list. I use a function (below) that duplicates (and extends to other Igor objects) the WaveList function. Another advantage is that you don't have to change datafolders to use GetIndexedObjName as with WaveList. I didn't realize what a performance hit that was (n vs n^2).
variable nT
variable ii, iii
variable timer, eTIme, nELs
newdatafolder/o/s root:wlt
killwaves/a/z
make/o/n= (nT/10) rTWave, rElWave, rTwave1, rElwave1
rTWave=0
relwave =0
display rTWave vs rElWave
appendtograph/r rTwave1 vs rElwave1
modifygraph mode=2, lsize=2
modifygraph rgb (rTwave1) = (0,0,65535)
doupdate
for (ii=0; ii < nT; ii+=1)
for (iii=0; iii < 3; iii += 1)
make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)
make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)
make/o/n = (100) $"W_yy" + num2str (3*ii+iii)
endfor
if (mod (ii, 10) ==0)
timer = startmstimer
nELs= itemsinlist (wavelist ("*yy*", ";", ""), ";")
eTIme = stopMSTimer(timer )
rTWave [ii/10] = eTIme
rElWave [ii/10] = nEls
timer = startmstimer
nELs= itemsinlist (ListObjects("root:wlt", 1, "*yy*", 0, ""), ";") // uses getIndexedObject in a loop
eTIme = stopMSTimer(timer )
rTWave1 [ii/10] = eTIme
rElWave1 [ii/10] = nEls
endif
if (mod (ii, 100) ==0)
doupdate
endif
endfor
end
//******************************************************************************************************
Function/s ListObjects(sourceFolderStr, objectType, matchStr, givepath, erstring)
string sourceFolderStr // can be either ":" or "" to specify the current data folder. You can also use a full or partial data folder path.
variable objectType / 1 = waves, 2= numeric variables, 3= string variables, 4 = data folders
string matchstr // limit list of objects with this wildcard-enhanced string. use "*" to get all objects
variable givepath // use 1 to get full or relative filepaths depending on sourceFolderStr. use 0 to get just the objects with no path.
string erstring // a string to return if no objects are found
string objName ="" // Will contain all the objects in the folder, one by one as we iterate through the index
string liststr = "" // will contain the list of objects found and matched
variable numObjs // the number of objects in the folder under consideration
variable ii // used to iterate through the list of objects
if (!(datafolderexists (sourceFolderStr)))
return "\\M1(" + sourceFolderStr + " does not exist." // why the "\\M1(" ? It is a formatting code for disabling choices in pop-up menus
else
numObjs = CountObjects(sourceFolderStr, objectType)
for (ii = 0; ii < numObjs; ii += 1)
objName = GetIndexedObjName(sourceFolderStr, objectType, ii)
if (!(stringmatch(objname, matchStr )))
continue
endif
if (givepath & 1)
if ((cmpstr (sourceFolderStr [strlen(sourceFolderStr)-1], ":")) == 0)
objName = sourceFolderstr + objname
else
objName = sourceFolderstr + ":" + objname
endif
endif
liststr += objname + SelectString(((objectType == 4) && (!(givePath &2))), ";", ":;")
endfor
if ((strlen (liststr)) < 2)
liststr = erstring
endif
return liststr
endif
end
Thinking about the idea of making a wave of wavereferences for indexing: Another simpler to set up and maintain solution might be to have a dictionary containing all the metadata keys mapped to individual letters, and, for each key, another dictionary of all the values for that key mapped to an integer. This mapping would provide short letter/number codes that could be used in wavenames. So the keys would be A,B,C, etc. and the values would be 1,2,3 etc, Wave named would look like A12B31C6. With 32 character length, you could have 7 keys, each of which could contain a maximum of 2^4 values, and still have 2 characters left over. Selecting all the waves that contained a particular value of a particular key would be a matter of finding the corresponding letter code for the key, the integer code for the value, and using the wavelist function, which at least gives us linear time, (and does it in compiled code, not Igor script).
741, the problem with using a database is it requires getting the data into a database, serving the database, and connecting to it. Also, a lot of our data are long waveform traces, or multiframe images. Not the stuff you want to be passing back and forth over a network. That being said, a lot of the analysis is done after some processing, extracting relative fluorescence changes from ROIS, eg. It may make sense to put those results into a database, and then use queries to easily look at results across multiple experiments.
While the topic of databases and metadata-rich scientific data, especially images is at hand, as anyone looked at aOMERO? http://www.openmicroscopy.org/site
OMERO is client-server software for visualisation, management and analysis of biological microscope images.
Looks kind of cool, and it has open source libraries that it would be possible to call from an Igor XOP. There is already Matlab access for it.
jamie
March 2, 2012 at 02:39 pm - Permalink