Compare Multiple Strings
proland
I have a user input for multiple wave names and I need to ensure that each is a unique entry. I know I can compare them to eachother in pairs using either stringMatch or cmpstr but this seems tedious. This is what I'm doing now and it works but I'd like to learn of a better way.
Could someone recommend a way to compare 6 or more strings quickly? I was thinking perhaps placing them in a list and using ListMatch and ItemsInList to determine how many times each name appear but this also seems a bit lengthy.
Sort
orSortList
. That way you only need to look at consecutive entries to remove duplicates.July 16, 2013 at 12:39 pm - Permalink
Function find_duplicated_entries(list)
string list
//ok, lets see if any are the same
//start off by putting them in a text wave
//make/n=(itemsinlist(list))/t/free individual
//individual[] = stringfromlist(p, list)
sockitstringtowave/DEST=individual/Free/TOK=";" 0, list
Wave/t pop = individual
make/n=(numpnts(pop) indx
indx=p
sort/A/C individual, individual, indx
make/n=(numpnts(individual))/o comparer
comparer = (p < numpnts(individual) - 1 && stringmatch(pop[p + 1], pop[p])) ? 1 : 0
if(sum(comparer))
extract/o individual, output, comparer > 0
extract/o indx, output_indx, comparer > 0
print "you had duplicates", output, "at indices", output_indx
endif
End
EDIT: changed rtglobals from 1 to 3
July 17, 2013 at 03:04 pm - Permalink
Not only does your code check for duplicates but also reports the duplicate entries. This will be helpful in my user interface.
Edit: there seems to be an error on "sockitstringtowave". Is this a function you have written elsewhere or is this a typo. I can't find a reference to this in the IGOR manual.
July 17, 2013 at 07:37 am - Permalink
I like it! (Except for the #pragma line; misleading comments are worse than no comments.) However, I'd rather not assume I have sockitstringtowave, or that I am using the default list separator, or that case sensitivity does not matter, so when I started using this to deal with some of the "//TODO: deal with duplicates" lines in some of my code, I ended up with this:
String listStr, listSepStr
Variable matchCase, verbose
WAVE/T T_dest, T_duplicates
verbose = ParamIsDefault(verbose) ? 0 : verbose
Variable i, n= ItemsInList(listStr, listSepStr)
Make/N=(n)/O/T T_list = StringFromList(p, listStr, listSepStr) // convert separated list to text wave
Make/N=(n)/O W_idx, W_flag, W_orig=p
// second sort key ensures that the duplicates stay in their original order
if (matchCase)
MakeIndex/C {T_list,W_orig} W_idx
else
MakeIndex {T_list,W_orig} W_idx
endif
W_flag = p==0 ? 0 : !cmpstr(T_list[W_idx[p]], T_list[W_idx[p-1]], matchCase)
// this one-liner does the same as the for loop
// W_orig = W_flag ? W_orig[p-1] : W_idx[p]
for ( i=0 ; i<n ; i+=1 )
if ( W_flag[i] )
W_orig[i] = W_orig[i-1]
else
W_orig[i] = W_idx[i]
endif
endfor
Sort W_idx W_orig, W_flag // O(N.logN) solution to the O(N) problem of inverting the permutation that sorted the list
if ( verbose )
for ( i=0 ; i<n ; i+=1 )
if ( W_flag[i] )
printf "\"%s\" at index %d is a duplicate of index i %d", T_list[i], i, W_orig[i]
endif
endfor
endif
if ( ParamIsDefault(T_dest))
Make/T/FREE/N=0 T_dest
endif
Extract/T T_list, T_dest, ! W_flag
if ( ! ParamIsDefault(T_duplicates))
Extract/T T_list, T_duplicates, W_flag
endif
Grep/E=".*"/Q/LIST=listSepStr T_dest // convert text wave to separated list
Return S_Value
End
If anybody knows of less hack-ish patterns for getting back to the original order from a sorted list or making a text wave into a separated list, please let me know!
July 17, 2013 at 10:55 am - Permalink
SOCKITstringtowave and SOCKITwavetostring are operations I wrote for the SOCKIT xop which transfer the contents from strings to waves and waves to strings. strings are just a single container for binary data. If you are dealing with text waves there is some functionality for tokenising, either putting them in (SOCKITwavetostring) or splitting on them (SOCKITwavetostring).
It does the transfer in a single line and is very fast.
July 17, 2013 at 03:08 pm - Permalink
I edited this. I made rtglobals=1 when debugging, but change it to 3 at the end.
Yes, you need to install the SOCKIT xop to use sockitstringtowave to put them in a textwave. However, those two operations come in really useful. For example, you go on to ask how do you go about "making a text wave into a separated list"? It's quite simple if you use SOCKITwavetostring, but a whole lot more difficult if you don't, loops galore. String concatenation can be very slow. It's not a problem for 6 entries, what about hundreds of thousands?
Resorting the list is quite simple in my snippet, you just use the indx wave to re-sort the textwave, then make back into a list. This is not a hack at all.
July 17, 2013 at 03:20 pm - Permalink
I'm guilty myself of having misleading comments in my code after debugging, often enough. Sorry if I offended with what I said about misleading comments.
I know your sockit xop comes in handy often, but on general principle I prefer to stick with things that work on a fresh install of Igor, for code I will be sharing or will be maintaining for a long time.
I wouldn't say it is a whole lot more difficult. There are other operations than sockitwavetostring that can take care of the loop. I used a one-liner using Igor's built-in Grep operation:
I haven't tried it on hundreds of thousands, but it's fast enough in my experience. Definitely faster than a loop. My only problem with it is that it's not very readable code. Without the comment, I'd likely have forgotten what that line does by the next time I read the code.
That's what I do in my snippet too. It's simple, but inefficient. Sorting should be unnecessary because the index wave has much more information than just the order in which the items go: it already has their exact places. I just can't think of a way to use that information in a way that doesn't end up being slower than using the built-in Sort operation. Maybe for waves with hundreds of thousands of elements, the straightforward O(N) solution to the problem including the overhead of an explicit loop and point-wise wave assignment
W_orig[W_idx[i]] = W_orig_Sorted[i]
W_flag[W_idx[i]] = W_flag_Sorted[i]
endfor
would be faster (Note that in my case these aren't variable-length strings to move around, but numbers representing indices and booleans.) than
If Igor's IndexSort operation had an "inverse" flag, I'd use that. The fact that it doesn't makes me wonder whether I'm missing something obvious.
According to the Jargon File, a hack is
To my mind, code does two jobs: it runs, and it communicates to the programmer. If it runs perfectly but it is hard to see why, I consider it a hack. That covers the Grep one-liner to turn a text wave into a delimited list. Using sort to get back to the original order after making an index is the exact opposite case. It is perfectly readable, but produces the desired result in a roundabout way. To my mind, at least, using Sort to get the advantages of doing the permutation in-place even though the sorting part is totally unnecessary is not doing the job well.
July 18, 2013 at 03:11 am - Permalink
I agree, I was a bit surprised that I couldn't do this in IGOR easily (that is SOCKITwavetostring and SOCKITstringtowave).
That may be the case for textwaves, but is not the case for numeric waves. Does the grep command work when the data in there is binary?
I may be missing something, but if the aim is to just remove duplicates from the original list, then can't one just use
removelistitem
? Simply use indx to figure out where the duplicates are and remove them from the list (from the highest index downwards).July 18, 2013 at 06:54 pm - Permalink
There may be many instances where removing duplicates is desires but my current aim was to simply use this as a check to ensure that the user selected appropriate waves for a calculation. For instance, if they select the same material property wave for two different materials, the calculation will be incorrect (end goal is to calculate polarizability of small semiconducting particles embedded in another material). Therefore, I plan to notify the user of their mistake and prompt them to correct it before proceeding as a check to prevent them from reporting incorrect results.
While reporting which entries are duplicates is very useful for large lists, my list is small and readily visible to the user on a panel interface so a simple notification that duplicates exist is currently enough for me.
July 19, 2013 at 06:03 am - Permalink
I don't think the grep command will accept a numeric wave.
Depending on how many duplicates there are and how long the list is, that could get slow, because every character after the newly-removed item needs to be moved to a new location in the string. Rebuilding the list from scratch only requires copying each string to its new location once. And I'd be surprised if getting the n-th element from a text wave isn't quicker than getting the n-th item from a delimited list. So, since I already have the text waves, I figured I might as well build the list from that. The Igor manual warns about that issue with text waves, in fact, that it is often quicker to set the length of the text wave to zero and rebuild completely than to change the lengths of a few elements.
Then again, I did use the O(n^2)
Make/N=(n)/O/T T_list = StringFromList(p, listStr, listSepStr)
instead of writing something that just traverses the list once, so it's a bit hypocritical of me to complain about this and about the O(N.log(N)) sort. Feel free to write a better version.July 19, 2013 at 03:23 pm - Permalink