Reading and extracting different sections from a file in one go?

Function FindGND_Energy(pathName, filePath, firstDataLine, numDataLines) String pathName // Name of symbolic path or "" String filePath // Name of file or partial path relative to symbolic path Variable &firstDataLine // Pass-by-reference output Variable &numDataLines // Pass-by-reference output firstDataLine = -1 numDataLines = -1 Variable refNum Open/R/P=$pathName refNum as filePath String buffer, text Variable line = 0 String targetString = " Total energy (H) =" Variable targetStringLength = strlen(targetString) // Find first line do FReadLine refNum, buffer if (strlen(buffer) == 0) Close refNum return -1 // The expected keyword was not found in the file endif text = buffer[0,targetStringLength-1] if (CmpStr(text,targetString) == 0) firstDataLine = line break // This is is the first data line endif line += 1 while(1) // Find last line targetString = " Nuc-nuc energy (H) =" targetStringLength = strlen(targetString) do FReadLine refNum, buffer if (strlen(buffer) == 0) // Ran out of lines - assume this is the last line of data line += 1 break endif text = buffer[0,targetStringLength-1] if (CmpStr(text,targetString) != 0) // Line does not start with "<space>#>? // This is is the line after the last data line break endif line += 1 while(1) numDataLines = line - firstDataLine + 1 // Print firstDataLine, numDataLines // For debugging only Close refNum return 0 // Success End

Function extractGND_En(pathName, filePath, extension,cAtom) String pathName // Name of symbolic path or "" to display dialog. String filePath // Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path. String extension Variable cAtom Variable refNum // Possibly display Open File dialog. if ((strlen(pathName)==0) || (strlen(filePath)==0)) Open /D /R /P=$pathName /T=(extension) refNum as filePath filePath = S_fileName // S_fileName is set by Open/D if (strlen(filePath) == 0) // User cancelled? return -1 endif // filePath is now a full path to the file. endif Variable firstDataLine, numLines Variable GNDE = FindGND_Energy(pathName, filePath, firstDataLine, numLines) if (GNDE != 0) Printf "No data found in file %s\r", filePath return -1 endif String DFTData = RemoveEnding(filePath,"gnd.out") String columnInfoStr = "" // Prepare parameter for /B flag columnInfoStr += "N='_skip_',W=1;" columnInfoStr += "N='_skip_',W=20;" columnInfoStr += "N='_skip_',W=3;" columnInfoStr += "N='GND_En_',W=16;" columnInfoStr += "N='_skip_',W=40;" LoadWave /F={6, 11, 0} /B=columnInfoStr /D /O /K=0 /L={0,firstDataLine,numLines,0,0} /A /P=$pathName filePath Wave GND_En_ String currentGNDname = "GND_En_" + num2str(cAtom) Duplicate/O GND_En_,$currentGNDname KillWaves/Z GND_En_ SetDataFolder root:Packages:DFTClustering return 0 End

Function LoadAllGND_En(pathName,filePath,nFiles) //Loads all .out files from StoBe containing info necessary to build tensor String pathName //Name of symbolic path of folder containing files to be loaded. You can also type "" to create a new symbolic path. String filePath //Name of file to be loaded as defined in LoadDFTData procedure Variable nFiles Variable i = 0 //Used to index files. Variable result if (strlen(pathName)==0) NewPath/O tempPath if (V_flag !=0) return -1 endif pathName = "tempPath" endif i=0 for(i=0;i<nFiles;i+=1) //Loops through every file in folder with a .out extension filePath = SortList(IndexedFile($pathName,-1,".out"),";",16) String currentFile = StringFromList(i,filePath) currentFile = RemoveEnding(currentFile,";") if (strlen(filePath) ==0) //if there are no more files break //end loop endif result = extractGND_En(pathName, currentFile, ".out",i+1) endfor //The loop ends after there are no more files with a .out extension present if (Exists("tempPath")) KillPath tempPath endif print filePath String GNDList = WaveList("GND_En*",";","") Make/O/N=(nFiles) GND for(i=0;i<nFiles;i+=1) String cGNDname = StringFromList(i,GNDList) Wave w = $cGNDname GND[i] = w[0] KillWaves/Z w endfor return 0 End

chozo

I assume the sections are marked with the lines you are defining as targetString, right? If not how would you discern the different sections of a file? A quick look at your functions gives me the impression that it would be the easiest approach to build another loop around the line searching function of FindGND_Energy() to have not a single value of firstDataLine and numDataLines, but to build a list of multiple entries for all the sections (if there are more than one). You could for example just replace the variables with a wave which gets extended as more sections are found. A single entry wave is than the previous case with only one section. Then within extractGND_En() loop through this list to load all sections successively. An alternative approach would be to merge both functions and refactor the code so that the section data is loaded after both firstDataLine and numDataLines are set and then repeat this process in a loop until all sections are processed.

July 8, 2020 at 02:32 am - Permalink

vmmr5596

That's correct. The different sections have different values for targetString and have their own read and load functions associated with it. Your suggestion makes sense. I'll see if I can implement it. Thanks!

July 8, 2020 at 02:42 am - Permalink

jjweimer

Create a wave of file paths.

wave/T filepaths

Use the first function to loop through all files in this wave. Store the target results in a matrix wave.

wave/N=(nfiles,2) startline, nlines

Use the second function with that wave as input to loop again through all files to get the values.

This modular approach is easier to debug in a step-wise manner. Also, you might consider passing the start and end strings as inputs to the first function rather than hard coding them in the function. This change will make it easier to re-use the code to parse through the file set for the locations of other start/end tags.

July 8, 2020 at 05:16 am - Permalink

aclight

The previous suggestions are both good ones.

First, your FindGND_Energy function loops through the file twice. There's not point in doing this, and until Igor 9 (in beta soon), FReadLine is a lot slower than it could be. I would refactor the function to loop through once and test each line for the start and end strings. Even better, if you can rely on the fact that the end string won't be before the start string, and/or that the start string will only be present once, then you can start with testing each line for only the start string. Once you find it, then you test each line for only the end string.

An alternative to using FReadLine in a loop would be to use Grep/Q/INDX. Grep will basically do the same thing you're doing now (run through the file a line at a time, attempting to match each line to a regular expression), but it will execute as C++ code instead of an Igor loop, which might be substantially faster. Note that since Grep uses the same code under the hood as FReadLine to read the file, it too is slower than it could be (until IP9).

If you use the Grep approach, you have two options. The first is to call Grep twice, once with an expression for the start string and once for the end string. This will result in you looping through the file twice, but coming up with the regular expression is simpler and your code might be more reliable. The other option would be to create a regular expression that matches either the start or the end string, and call Grep only one time. The W_Index wave would then have 2 entries (presumably), and you would treat the first as the start row number and the second as the end row number. If you can make assumptions about your files this may be the best approach.

I also recommend that you change extractGND_En to return either a wave reference to the output wave (what is created by this line: "Duplicate/O GND_En_,$currentGNDname") or a string containing the name of the wave created by that same line. The reason I recommend this is that your WaveList command that you use later will give you a list of waves with undefined order. Currently the order is the order in which the wave was added to the data folder, which is probably what you want, but it is possible that future changes to your code (eg. loading in multiple threads) could result in the waves being added to the data folder in a different order, which might throw off your downstream code. An alternative would be to just call SortList on the output of WaveList to sort the wave names alphanumerically. That way your GNDList string always has a defined order.

July 8, 2020 at 08:48 am - Permalink

vmmr5596

These are all great suggestions!

Making the code more modular is definitely one of the things I'm aiming for so that seems like a good strategy to accomplish that.

I can definitely rely on the start string only happening once and the end string always happening after the start string so I could refactor the code to only loop through the file once.Thanks for pointing that out! Switching to GREP might be worthwhile since some of the files I'm reading can be quite large at times.

There is one more important thing that I failed to mention regarding the different sections and it's that they have different column structures. Therefore, I will need a different /B flag for the LoadWave command. This is one of the main reasons I ended up structuring my routine the way I did since I wasn't sure if there was a better way. Is there a workaround for this?

Thanks for all the help!

Here are the columnInfoString values I use for the different sections:

Section 1 columnInfoStr

columnInfoStr += "N='_skip_',W=1;"
columnInfoStr += "N='_skip_',W=20;"
columnInfoStr += "N='_skip_',W=3;"
columnInfoStr += "N='GND_En_',W=16;"
columnInfoStr += "N='_skip_',W=40;"

Section 2 columnInfoStr

columnInfoStr += "N='_skip_',W=2;"
columnInfoStr += "N='_skip_',W=4;"
columnInfoStr += "N='LUMO_En_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"

Section 3 columnInfoStr

columnInfoStr += "N='_skip_',W=1;"
columnInfoStr += "N='LUMO_POS_',W=7;"
columnInfoStr += "N='OCCU_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"

July 8, 2020 at 04:29 pm - Permalink

aclight

Your different columnInfoStr strings could be input parameters to the common routine that loads a section of the file.

July 23, 2020 at 07:15 am - Permalink

vmmr5596

Thank you so much everyone for the suggestions! I got it working via a combination of the different approaches mentioned in the thread.

July 23, 2020 at 09:56 am - Permalink