Find longest common substring in a list?
Greetings all,
I'm wondering if there is a known method (built-in or code snippets) to solve the following problem.
I have several lists of strings whose entries are generally of the form "SampleID_ReplicateID;".
SampleID and ReplicateID are manually entered by the operators when collecting data and the combined entry is automatically applied to the output. SampleID is entered once for a sample, and ReplicateID is entered for every replicate run. Each list represents one SampleID.
Unfortunately I cannot enforce a format for the ReplicateID. It is freeform text. It could be 1,2,3,... or A,B,C,... or even Jane,Paul,Mary,...
I want to extract the SampleID by itself from each List.
SampleID and ReplicateID may contain "_" so it is not as simple as searching for that character.
But since SampleID will be the same in each list, I basically want to search for the longest substring that is common amongst each list item. (If there are common leading characters in the ReplicateID I may try to deal with them, but I don't think that will actually affect my subsequent processing.)
Any suggestions?
--Andreas
Edit: I had posted an idea but then I re-read your post and realized I was not understanding the problem. If you could post a couple of example lists, that would help me understand the problem better.
July 19, 2019 at 04:02 pm - Permalink
In reply to If I understand you… by gsb
Here is an example list
P20_C1_F1_R3__1903_036;P20_C1_F1_R3__1903_037;P20_C1_F1_R3__1903_038;P20_C1_F1_R3__1903_039;P20_C1_F1_R3__1903_040;
In this case, the SampleID = P20_C1_F1_R3__1903
The longest common substring would actually be P20_C1_F1_R3__1903_0 (due to the leading zero on the ReplicateID), but I can probably handle that downstream.
Another example is
P7_PRE_B;P7_1_A;P7_1_B;P7_2_A;P7_2_B;
Here SampleID = P7, which is also the longest substring after removal of the terminal "_".
July 19, 2019 at 05:12 pm - Permalink
In reply to Here is an example list P20… by topchem
If sampleID is always the start of each list item, you could take a simple brute-force approach and figure out how many characters it is until the strings don't match. E.g., the following seems to work if you run testit()
String list
String &listWithoutStartStr //optional -- will hold list without startStr
WAVE/T txt = listtotextwave(list,";")
String testChar=(txt[0])[0],commonStartStr=""
Variable i,items=dimsize(txt,0),count,doBreak=0
for (count=0;strlen(testChar)>0;)
for (i=1;i<items;i+=1)
if (!stringmatch( (txt[i])[count],testChar))
doBreak=1
break
endif
endfor
if (doBreak)
break
endif
commonStartStr+=testChar
count+=1
testChar = (txt[0])[count]
endfor
if ( !paramIsDefault(listWithoutStartStr) )
txt = ReplaceString(commonStartStr,txt[p],"",0,1)
wfprintf listWithoutStartStr,"%s;",txt
endif
return commonStartStr
end
function testit()
String blah
print longestCommonStartStr("aaadaab;aaaaa;aaaaac;",listWithoutStartStr=blah)
print blah
end
July 19, 2019 at 06:21 pm - Permalink
In reply to If sampleID is always the… by gsb
That indeed seems to do what I want.
Thanks.
July 19, 2019 at 09:45 pm - Permalink
If you have the name as a string wave you could sort the wave alphabetically with Sort and then find the longest common string of the first and last strings in the wave. That way you only have to compare two strings and not all of them.
July 21, 2019 at 04:02 am - Permalink
In reply to If you have the name as a… by olelytken
With Igor 7 or later you can use the ListToTextWave function to convert a string list into a text wave. However for the purposes of this question you could also just use SortList (probably on a copy of the string list) without first converting the string list into a text wave.
July 21, 2019 at 06:55 am - Permalink
This incorporates the suggestion from Olelytken and Adam. It avoids an explicit for loop.
string strlist,oend
if (ParamIsDefault(oend))
oend = "_"
endif
string rtstr = "",fstr,lstr
variable npts
npts = ItemsInList(strlist)
strlist = SortList(strlist)
fstr = StringFromList(0,strlist)
lstr = StringFromList(npts-1,strlist)
npts = min(strlen(fstr),strlen(lstr))
make/N=(npts)/FREE rwstr
rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0
FindValue/V=0 rwstr
npts = V_value - 1
rtstr = RemoveEnding(fstr[0,npts],oend)
return rtstr
end
I had initially thought to build a version that built a string (wave) as rwstr = ... ? faster[p] : "". As noted in a separate thread, the logical comparison would use SelectString(cmpstr(...),fstr[p],""). I don't know of any advantages unless the string can be built directly rather than using a text wave that must then be converted back to a string.
July 21, 2019 at 11:35 am - Permalink
In reply to This incorporates the… by jjweimer
This avoids an explicit loop, but implicitly iterates p over wave rwstr and strings fstr & lstr. Other than having more compact code is this still more efficient than a loop?
July 21, 2019 at 10:02 pm - Permalink
Using p instead of a for loop is usually faster. You could probably also multithread the p line, although I suspect the slowest bit in the code is the SortList line.
MultiThread rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0
July 21, 2019 at 11:05 pm - Permalink
@topchem: How many items do you expect in the string list?
July 22, 2019 at 05:58 am - Permalink
In reply to @topchem: How many items do… by thomas_braun
Not many. Probably a maximum of 20.
With a few tens of sample folders.
I don't expect execution time to be an issue for my use case. I'm just always curious how different coding approaches scale.
July 22, 2019 at 09:26 am - Permalink
By habit, I try to avoid explicit for loops when I can construct implicit ones. They should generally be faster and they are easier for me to debug visually.
When the strings in the list all have at least one character past a common prefix, you can eliminate the list sorting and choose any two of the items in the list (e.g. the first and last).
July 22, 2019 at 09:45 am - Permalink
With only 20 list items I think the solution from jjweimer is fine. With more than a couple of thousands switchting to using textwaves instead could be faster.
July 22, 2019 at 01:13 pm - Permalink