Regex advice

Hello, 

I am trying to filter a 1D text wave1 that looks something like this: 100.0787, C4H7O2, C7H8NO3, 230.0045, C8H8NS, ....

I am trying to get to the regex expression that would allow to extract only CHO containing cells from this wave.

I tried Grep/INDX/Q/E={"[NS]",1} wave1. This removes all N and S containing compounds but leaves the numeric entries in there. I need to output a purely CHO-containing wave.

Please advise.. 

Sincerely, 

Peeyush

you can use multiple regex expressions with grep, so try these:

must have C, H, or O, can have numbers:

grep/E="^[CHO0-9]*$"/E="[CHO]"

must have C, H and O:

grep/E="(?=.*C)(?=.*H)(?=.*O).+"

 

Hi Tony, 

Thanks for the advice. I tried these but somehow I'm still struggling. Please see the 1D wave attached that I'm trying to filter. 

The "Na" when it is not in the beginning of a formula is an adduct ion, so I replacestring it with a "" before proceeding. I do this because I'm working with multiple filters, one of which is CHN species (i.e. C,H and N containing compounds but no other elements in the formula). Here, the "N" from "Na" interferes in getting the reduced nitrogen-containing species out. 

Meanwhile, I tried for CHO species but it is somehow not pulling only the purely C,H and O containing formulas. 

Would be thankful for your advice. 

Sincerely, 

Peeyush

Species_table.pxp (38.7 KB)

Perhaps this will work.

// function to remove stuff from a string wave
// input string wave, returns string list
Function/S grep_removestuff(wave strwave, [variable keepdigits, string ignorelttrs])
   
    string rstr
    wfprintf rstr, "%s;", tstrwave
    if (ParamIsDefault(keepdigits))
        rstr = GrepList(rstr,"^[[:alpha:]]")   
    endif
   
    if (!ParamIsDefault(ignorelttrs))
        string mtchstr
        mtchstr = "(?i)[" + ignorelttrs +"]+"
        rstr = GrepList(rstr,mtchstr,1)
    endif
    return rstr
end
grep_removestuff(speciesmasstext_out,ignorelttrs="nmi")

This one

grep/E="^[CHO0-9]*$"/E="[CHO]"

ensures that all characters fall within those specified in the character class in the first regular expression, and at least one character matches those specified in the second. That excludes Na. If you want to catch the formulae with a trailing +, you have to add that to the first expression.

try

grep/E="^[CHO0-9+]*$"/E="[CHO]" speciesmasstext_out

I don't *think* you have to escape the + in the character class.

If the output must have all three elements, use

grep/E="^[CHO0-9+]*$"/E="(?=.*C)(?=.*H)(?=.*O).+" speciesmasstext_out

or simply (perhaps less simply?)

grep/E="(?=.*C)(?=.*H)(?=.*O)^[CHO0-9+]*$" speciesmasstext_out