Find longest common substring in a list?

Edit: I had posted an idea but then I re-read your post and realized I was not understanding the problem. If you could post a couple of example lists, that would help me understand the problem better.

Log in or register to post comments

July 19, 2019 at 04:02 pm - Permalink

topchem

Here is an example list

P20_C1_F1_R3__1903_036;P20_C1_F1_R3__1903_037;P20_C1_F1_R3__1903_038;P20_C1_F1_R3__1903_039;P20_C1_F1_R3__1903_040;

In this case, the SampleID = P20_C1_F1_R3__1903

The longest common substring would actually be P20_C1_F1_R3__1903_0 (due to the leading zero on the ReplicateID), but I can probably handle that downstream.

Another example is

P7_PRE_B;P7_1_A;P7_1_B;P7_2_A;P7_2_B;

Here SampleID = P7, which is also the longest substring after removal of the terminal "_".

Log in or register to post comments

July 19, 2019 at 05:12 pm - Permalink

gsb

If sampleID is always the start of each list item, you could take a simple brute-force approach and figure out how many characters it is until the strings don't match. E.g., the following seems to work if you run testit()

function/S longestCommonStartStr(list,[listWithoutStartStr])
    String list
    String &listWithoutStartStr //optional -- will hold list without startStr
    
    WAVE/T txt = listtotextwave(list,";")   
    String testChar=(txt[0])[0],commonStartStr=""
    Variable i,items=dimsize(txt,0),count,doBreak=0
    
    for (count=0;strlen(testChar)>0;)
        
        for (i=1;i<items;i+=1)
            if (!stringmatch( (txt[i])[count],testChar))
                doBreak=1
                break
            endif
        endfor  
        
        if (doBreak)
            break
        endif
        
        commonStartStr+=testChar
        count+=1
        testChar = (txt[0])[count]
    endfor
    
    if ( !paramIsDefault(listWithoutStartStr) )
        txt = ReplaceString(commonStartStr,txt[p],"",0,1)
        wfprintf listWithoutStartStr,"%s;",txt
    endif
    
    return commonStartStr
    
end
 
function testit()
    String blah
    print longestCommonStartStr("aaadaab;aaaaa;aaaaac;",listWithoutStartStr=blah)
    print blah
end

Log in or register to post comments

July 19, 2019 at 06:21 pm - Permalink

topchem

That indeed seems to do what I want.

Thanks.

Log in or register to post comments

July 19, 2019 at 09:45 pm - Permalink

olelytken

If you have the name as a string wave you could sort the wave alphabetically with Sort and then find the longest common string of the first and last strings in the wave. That way you only have to compare two strings and not all of them.

Log in or register to post comments

July 21, 2019 at 04:02 am - Permalink

aclight

olelytken wrote:
If you have the name as a string wave you could sort the wave alphabetically with Sort and then find the longest common string of the first and last strings in the wave. That way you only have to compare two strings and not all of them.

With Igor 7 or later you can use the ListToTextWave function to convert a string list into a text wave. However for the purposes of this question you could also just use SortList (probably on a copy of the string list) without first converting the string list into a text wave.

Log in or register to post comments

July 21, 2019 at 06:55 am - Permalink

jjweimer

This incorporates the suggestion from Olelytken and Adam. It avoids an explicit for loop.

Function/S get_longestcommonprefix(strlist,[oend])
    string strlist,oend
    
    if (ParamIsDefault(oend))
        oend = "_"
    endif
    
    string rtstr = "",fstr,lstr
    variable npts
    
    npts = ItemsInList(strlist)
    strlist = SortList(strlist)
    fstr = StringFromList(0,strlist)
    lstr = StringFromList(npts-1,strlist)
    
    npts = min(strlen(fstr),strlen(lstr))
    make/N=(npts)/FREE rwstr    
    rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0
 
    FindValue/V=0 rwstr
    npts = V_value - 1
    
    rtstr = RemoveEnding(fstr[0,npts],oend)
    
    return rtstr
    
end

I had initially thought to build a version that built a string (wave) as rwstr = ... ? faster[p] : "". As noted in a separate thread, the logical comparison would use SelectString(cmpstr(...),fstr[p],""). I don't know of any advantages unless the string can be built directly rather than using a text wave that must then be converted back to a string.

Log in or register to post comments

July 21, 2019 at 11:35 am - Permalink

topchem

This avoids an explicit loop, but implicitly iterates p over wave rwstr and strings fstr & lstr. Other than having more compact code is this still more efficient than a loop?

Log in or register to post comments

July 21, 2019 at 10:02 pm - Permalink

olelytken

Using p instead of a for loop is usually faster. You could probably also multithread the p line, although I suspect the slowest bit in the code is the SortList line.

MultiThread rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0

Log in or register to post comments

July 21, 2019 at 11:05 pm - Permalink

thomas_braun

@topchem: How many items do you expect in the string list?

Log in or register to post comments

July 22, 2019 at 05:58 am - Permalink

topchem

Not many. Probably a maximum of 20.

With a few tens of sample folders.

I don't expect execution time to be an issue for my use case. I'm just always curious how different coding approaches scale.

Log in or register to post comments

July 22, 2019 at 09:26 am - Permalink

jjweimer

By habit, I try to avoid explicit for loops when I can construct implicit ones. They should generally be faster and they are easier for me to debug visually.

When the strings in the list all have at least one character past a common prefix, you can eliminate the list sorting and choose any two of the items in the list (e.g. the first and last).

Log in or register to post comments

July 22, 2019 at 09:45 am - Permalink

thomas_braun

With only 20 list items I think the solution from jjweimer is fine. With more than a couple of thousands switchting to using textwaves instead could be faster.

Log in or register to post comments

July 22, 2019 at 01:13 pm - Permalink