Compare Multiple Strings

My strategy would probably be to put the strings into a text wave or string list, then use Sort or SortList. That way you only need to look at consecutive entries to remove duplicates.

Log in or register to post comments

July 16, 2013 at 12:39 pm - Permalink

andyfaff

How about something along these lines:

#pragma rtGlobals=3     // Use modern global access method and strict wave access.
Function find_duplicated_entries(list)
    string list
 
    //ok, lets see if any are the same
    //start off by putting them in a text wave 
    //make/n=(itemsinlist(list))/t/free individual
    //individual[] = stringfromlist(p, list)
    sockitstringtowave/DEST=individual/Free/TOK=";" 0, list
    Wave/t pop = individual
        make/n=(numpnts(pop) indx
        indx=p
 
    sort/A/C individual, individual, indx
    make/n=(numpnts(individual))/o comparer
    comparer = (p < numpnts(individual) - 1 && stringmatch(pop[p + 1], pop[p])) ? 1 : 0
    if(sum(comparer))
        extract/o individual, output, comparer > 0
                extract/o indx, output_indx, comparer > 0
        print "you had duplicates", output, "at indices", output_indx
    endif
End

EDIT: changed rtglobals from 1 to 3

Log in or register to post comments

July 17, 2013 at 03:04 pm - Permalink

proland

Thanks Andy,

Not only does your code check for duplicates but also reports the duplicate entries. This will be helpful in my user interface.

Edit: there seems to be an error on "sockitstringtowave". Is this a function you have written elsewhere or is this a typo. I can't find a reference to this in the IGOR manual.

Log in or register to post comments

July 17, 2013 at 07:37 am - Permalink

sbossuyt

andyfaff wrote:

How about something along these lines:

I like it! (Except for the #pragma line; misleading comments are worse than no comments.) However, I'd rather not assume I have sockitstringtowave, or that I am using the default list separator, or that case sensitivity does not matter, so when I started using this to deal with some of the "//TODO: deal with duplicates" lines in some of my code, I ended up with this:

Function/T CollapseDuplicatesInList(listStr, listSepStr, matchCase, [verbose, T_dest, T_duplicates])
    String listStr, listSepStr
    Variable matchCase, verbose
    WAVE/T T_dest, T_duplicates
 
    verbose = ParamIsDefault(verbose) ? 0 : verbose
    
    Variable i, n= ItemsInList(listStr, listSepStr)
    Make/N=(n)/O/T T_list = StringFromList(p, listStr, listSepStr) // convert separated list to text wave
    Make/N=(n)/O W_idx, W_flag, W_orig=p
    // second sort key ensures that the duplicates stay in their original order
    if (matchCase)
        MakeIndex/C {T_list,W_orig} W_idx
    else
        MakeIndex {T_list,W_orig} W_idx
    endif
    
    W_flag = p==0 ? 0 : !cmpstr(T_list[W_idx[p]], T_list[W_idx[p-1]], matchCase)
    // this one-liner does the same as the for loop
    // W_orig = W_flag ? W_orig[p-1] : W_idx[p]
    for ( i=0 ; i<n ; i+=1 )
        if ( W_flag[i] )
            W_orig[i] = W_orig[i-1]
        else
            W_orig[i] = W_idx[i]
        endif
    endfor
    Sort W_idx W_orig, W_flag   // O(N.logN) solution to the O(N) problem of inverting the permutation that sorted the list
 
    if ( verbose )
        for ( i=0 ; i<n ; i+=1 )
            if ( W_flag[i] )
                printf "\"%s\" at index %d is a duplicate of index i %d", T_list[i], i, W_orig[i]
            endif
        endfor
    endif
    
    if ( ParamIsDefault(T_dest))
        Make/T/FREE/N=0 T_dest
    endif
    Extract/T T_list, T_dest, ! W_flag
    
    if ( ! ParamIsDefault(T_duplicates))
        Extract/T T_list, T_duplicates, W_flag
    endif
    
    Grep/E=".*"/Q/LIST=listSepStr T_dest // convert text wave to separated list
    Return S_Value
End

If anybody knows of less hack-ish patterns for getting back to the original order from a sorted list or making a text wave into a separated list, please let me know!

Log in or register to post comments

July 17, 2013 at 10:55 am - Permalink

andyfaff

proland wrote:
Thanks Andy,

Not only does your code check for duplicates but also reports the duplicate entries. This will be helpful in my user interface.

Edit: there seems to be an error on "sockitstringtowave". Is this a function you have written elsewhere or is this a typo. I can't find a reference to this in the IGOR manual.

SOCKITstringtowave and SOCKITwavetostring are operations I wrote for the SOCKIT xop which transfer the contents from strings to waves and waves to strings. strings are just a single container for binary data. If you are dealing with text waves there is some functionality for tokenising, either putting them in (SOCKITwavetostring) or splitting on them (SOCKITwavetostring).
It does the transfer in a single line and is very fast.

Log in or register to post comments

July 17, 2013 at 03:08 pm - Permalink

andyfaff

sbossuyt wrote:

I like it! (Except for the #pragma line; misleading comments are worse than no comments.)

I edited this. I made rtglobals=1 when debugging, but change it to 3 at the end.
Yes, you need to install the SOCKIT xop to use sockitstringtowave to put them in a textwave. However, those two operations come in really useful. For example, you go on to ask how do you go about "making a text wave into a separated list"? It's quite simple if you use SOCKITwavetostring, but a whole lot more difficult if you don't, loops galore. String concatenation can be very slow. It's not a problem for 6 entries, what about hundreds of thousands?

Resorting the list is quite simple in my snippet, you just use the indx wave to re-sort the textwave, then make back into a list. This is not a hack at all.

Log in or register to post comments

July 17, 2013 at 03:20 pm - Permalink

sbossuyt

andyfaff wrote:
I made rtglobals=1 when debugging, but change it to 3 at the end.

I'm guilty myself of having misleading comments in my code after debugging, often enough. Sorry if I offended with what I said about misleading comments.

andyfaff wrote:

Yes, you need to install the SOCKIT xop to use sockitstringtowave to put them in a textwave. However, those two operations come in really useful.

I know your sockit xop comes in handy often, but on general principle I prefer to stick with things that work on a fresh install of Igor, for code I will be sharing or will be maintaining for a long time.

andyfaff wrote:

For example, you go on to ask how do you go about "making a text wave into a separated list"? It's quite simple if you use SOCKITwavetostring, but a whole lot more difficult if you don't, loops galore. String concatenation can be very slow. It's not a problem for 6 entries, what about hundreds of thousands?

I wouldn't say it is a whole lot more difficult. There are other operations than sockitwavetostring that can take care of the loop. I used a one-liner using Igor's built-in Grep operation:

Grep/E=".*"/Q/LIST=listSepStr T_dest // convert text wave to separated list

I haven't tried it on hundreds of thousands, but it's fast enough in my experience. Definitely faster than a loop. My only problem with it is that it's not very readable code. Without the comment, I'd likely have forgotten what that line does by the next time I read the code.

andyfaff wrote:
Resorting the list is quite simple in my snippet, you just use the indx wave to re-sort the textwave, then make back into a list.

That's what I do in my snippet too. It's simple, but inefficient. Sorting should be unnecessary because the index wave has much more information than just the order in which the items go: it already has their exact places. I just can't think of a way to use that information in a way that doesn't end up being slower than using the built-in Sort operation. Maybe for waves with hundreds of thousands of elements, the straightforward O(N) solution to the problem including the overhead of an explicit loop and point-wise wave assignment

    for ( i=0 ; i<n ; i+=1 )
        W_orig[W_idx[i]] = W_orig_Sorted[i]
        W_flag[W_idx[i]] = W_flag_Sorted[i]
    endfor

would be faster (Note that in my case these aren't variable-length strings to move around, but numbers representing indices and booleans.) than

    Sort W_idx W_orig, W_flag   // O(N.logN) solution to the O(N) problem of inverting the permutation that sorted the list

If Igor's IndexSort operation had an "inverse" flag, I'd use that. The fact that it doesn't makes me wonder whether I'm missing something obvious.

andyfaff wrote:
This is not a hack at all.

According to the Jargon File, a hack is

1. /n./ Originally, a quick job that produces what is needed, but not well.

To my mind, code does two jobs: it runs, and it communicates to the programmer. If it runs perfectly but it is hard to see why, I consider it a hack. That covers the Grep one-liner to turn a text wave into a delimited list. Using sort to get back to the original order after making an index is the exact opposite case. It is perfectly readable, but produces the desired result in a roundabout way. To my mind, at least, using Sort to get the advantages of doing the permutation in-place even though the sorting part is totally unnecessary is not doing the job well.

Log in or register to post comments

July 18, 2013 at 03:11 am - Permalink

andyfaff

sbossuyt wrote:

I know your sockit xop comes in handy often, but on general principle I prefer to stick with things that work on a fresh install of Igor, for code I will be sharing or will be maintaining for a long time.

I agree, I was a bit surprised that I couldn't do this in IGOR easily (that is SOCKITwavetostring and SOCKITstringtowave).

sbossuyt wrote:

I wouldn't say it is a whole lot more difficult. There are other operations than sockitwavetostring that can take care of the loop. I used a one-liner using Igor's built-in Grep operation:

That may be the case for textwaves, but is not the case for numeric waves. Does the grep command work when the data in there is binary?

I may be missing something, but if the aim is to just remove duplicates from the original list, then can't one just use removelistitem? Simply use indx to figure out where the duplicates are and remove them from the list (from the highest index downwards).

Log in or register to post comments

July 18, 2013 at 06:54 pm - Permalink

proland

andyfaff wrote:

I may be missing something, but if the aim is to just remove duplicates from the original list, then can't one just use removelistitem? Simply use indx to figure out where the duplicates are and remove them from the list (from the highest index downwards).

There may be many instances where removing duplicates is desires but my current aim was to simply use this as a check to ensure that the user selected appropriate waves for a calculation. For instance, if they select the same material property wave for two different materials, the calculation will be incorrect (end goal is to calculate polarizability of small semiconducting particles embedded in another material). Therefore, I plan to notify the user of their mistake and prompt them to correct it before proceeding as a check to prevent them from reporting incorrect results.

While reporting which entries are duplicates is very useful for large lists, my list is small and readily visible to the user on a panel interface so a simple notification that duplicates exist is currently enough for me.

Log in or register to post comments

July 19, 2013 at 06:03 am - Permalink

sbossuyt

andyfaff wrote:
Does the grep command work when the data in there is binary?

I don't think the grep command will accept a numeric wave.

andyfaff wrote:
I may be missing something, but if the aim is to just remove duplicates from the original list, then can't one just use removelistitem? Simply use indx to figure out where the duplicates are and remove them from the list (from the highest index downwards).

Depending on how many duplicates there are and how long the list is, that could get slow, because every character after the newly-removed item needs to be moved to a new location in the string. Rebuilding the list from scratch only requires copying each string to its new location once. And I'd be surprised if getting the n-th element from a text wave isn't quicker than getting the n-th item from a delimited list. So, since I already have the text waves, I figured I might as well build the list from that. The Igor manual warns about that issue with text waves, in fact, that it is often quicker to set the length of the text wave to zero and rebuild completely than to change the lengths of a few elements.

Then again, I did use the O(n^2) Make/N=(n)/O/T T_list = StringFromList(p, listStr, listSepStr) instead of writing something that just traverses the list once, so it's a bit hypocritical of me to complain about this and about the O(N.log(N)) sort. Feel free to write a better version.

Log in or register to post comments

July 19, 2013 at 03:23 pm - Permalink