Reading and extracting different sections from a file in one go?
Hi there,
I have a routine that I use to read the contents from a file and place them into waves. The read and load function works fine, however, as of now, the process I use requires me to load the same file multiple times to extract the contents I want. For example, let's say I have some file called File A, and this file has 3 different sections that I would like to extract.
File A would look something like this:
stuff
...
beginning of section 1
...
end of section 1
stuff
beginning of section 2
...
end of section 2
stuff
beginning of section 3
...
end of section 3
stuff
The function I use makes use of FReadLine while looking for some targetString that identifies the beginning of the section and a target String that identifies the end of the section. This read function is enclosed within another function that loads the data from that file into the appropriate waves for IGOR using the /B flag to to define the spacing of the various columns. Therefore, I would end up calling this function with different values for the target String for each section I want which has gotten to the point where it's somewhat annoying to use. What's a good strategy to accomplish this? Any nudge in the appropriate direction would be appreciated. Thanks!
Here are examples of the functions I use.
This is the function that reads the document and looks for the target strings identifying the current section I want
String pathName // Name of symbolic path or ""
String filePath // Name of file or partial path relative to symbolic path
Variable &firstDataLine // Pass-by-reference output
Variable &numDataLines // Pass-by-reference output
firstDataLine = -1
numDataLines = -1
Variable refNum
Open/R/P=$pathName refNum as filePath
String buffer, text
Variable line = 0
String targetString = " Total energy (H) ="
Variable targetStringLength = strlen(targetString)
// Find first line
do
FReadLine refNum, buffer
if (strlen(buffer) == 0)
Close refNum
return -1 // The expected keyword was not found in the file
endif
text = buffer[0,targetStringLength-1]
if (CmpStr(text,targetString) == 0)
firstDataLine = line
break // This is is the first data line
endif
line += 1
while(1)
// Find last line
targetString = " Nuc-nuc energy (H) ="
targetStringLength = strlen(targetString)
do
FReadLine refNum, buffer
if (strlen(buffer) == 0)
// Ran out of lines - assume this is the last line of data
line += 1
break
endif
text = buffer[0,targetStringLength-1]
if (CmpStr(text,targetString) != 0) // Line does not start with "<space>#>?
// This is is the line after the last data line
break
endif
line += 1
while(1)
numDataLines = line - firstDataLine + 1
// Print firstDataLine, numDataLines // For debugging only
Close refNum
return 0 // Success
End
This is the function used to load an individual file using the read function for a particular section of the document.
String pathName // Name of symbolic path or "" to display dialog.
String filePath // Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
String extension
Variable cAtom
Variable refNum
// Possibly display Open File dialog.
if ((strlen(pathName)==0) || (strlen(filePath)==0))
Open /D /R /P=$pathName /T=(extension) refNum as filePath
filePath = S_fileName // S_fileName is set by Open/D
if (strlen(filePath) == 0) // User cancelled?
return -1
endif
// filePath is now a full path to the file.
endif
Variable firstDataLine, numLines
Variable GNDE = FindGND_Energy(pathName, filePath, firstDataLine, numLines)
if (GNDE != 0)
Printf "No data found in file %s\r", filePath
return -1
endif
String DFTData = RemoveEnding(filePath,"gnd.out")
String columnInfoStr = "" // Prepare parameter for /B flag
columnInfoStr += "N='_skip_',W=1;"
columnInfoStr += "N='_skip_',W=20;"
columnInfoStr += "N='_skip_',W=3;"
columnInfoStr += "N='GND_En_',W=16;"
columnInfoStr += "N='_skip_',W=40;"
LoadWave /F={6, 11, 0} /B=columnInfoStr /D /O /K=0 /L={0,firstDataLine,numLines,0,0} /A /P=$pathName filePath
Wave GND_En_
String currentGNDname = "GND_En_" + num2str(cAtom)
Duplicate/O GND_En_,$currentGNDname
KillWaves/Z GND_En_
SetDataFolder root:Packages:DFTClustering
return 0
End
This is the function that takes the above functions and uses them to load multiple files:
String pathName //Name of symbolic path of folder containing files to be loaded. You can also type "" to create a new symbolic path.
String filePath //Name of file to be loaded as defined in LoadDFTData procedure
Variable nFiles
Variable i = 0 //Used to index files.
Variable result
if (strlen(pathName)==0)
NewPath/O tempPath
if (V_flag !=0)
return -1
endif
pathName = "tempPath"
endif
i=0
for(i=0;i<nFiles;i+=1) //Loops through every file in folder with a .out extension
filePath = SortList(IndexedFile($pathName,-1,".out"),";",16)
String currentFile = StringFromList(i,filePath)
currentFile = RemoveEnding(currentFile,";")
if (strlen(filePath) ==0) //if there are no more files
break //end loop
endif
result = extractGND_En(pathName, currentFile, ".out",i+1)
endfor //The loop ends after there are no more files with a .out extension present
if (Exists("tempPath"))
KillPath tempPath
endif
print filePath
String GNDList = WaveList("GND_En*",";","")
Make/O/N=(nFiles) GND
for(i=0;i<nFiles;i+=1)
String cGNDname = StringFromList(i,GNDList)
Wave w = $cGNDname
GND[i] = w[0]
KillWaves/Z w
endfor
return 0
End
I assume the sections are marked with the lines you are defining as targetString, right? If not how would you discern the different sections of a file? A quick look at your functions gives me the impression that it would be the easiest approach to build another loop around the line searching function of FindGND_Energy() to have not a single value of firstDataLine and numDataLines, but to build a list of multiple entries for all the sections (if there are more than one). You could for example just replace the variables with a wave which gets extended as more sections are found. A single entry wave is than the previous case with only one section. Then within extractGND_En() loop through this list to load all sections successively. An alternative approach would be to merge both functions and refactor the code so that the section data is loaded after both firstDataLine and numDataLines are set and then repeat this process in a loop until all sections are processed.
July 8, 2020 at 02:32 am - Permalink
In reply to I assume the sections are… by chozo
That's correct. The different sections have different values for targetString and have their own read and load functions associated with it. Your suggestion makes sense. I'll see if I can implement it. Thanks!
July 8, 2020 at 02:42 am - Permalink
Create a wave of file paths.
wave/T filepaths
Use the first function to loop through all files in this wave. Store the target results in a matrix wave.
wave/N=(nfiles,2) startline, nlines
Use the second function with that wave as input to loop again through all files to get the values.
This modular approach is easier to debug in a step-wise manner. Also, you might consider passing the start and end strings as inputs to the first function rather than hard coding them in the function. This change will make it easier to re-use the code to parse through the file set for the locations of other start/end tags.
July 8, 2020 at 05:16 am - Permalink
The previous suggestions are both good ones.
First, your FindGND_Energy function loops through the file twice. There's not point in doing this, and until Igor 9 (in beta soon), FReadLine is a lot slower than it could be. I would refactor the function to loop through once and test each line for the start and end strings. Even better, if you can rely on the fact that the end string won't be before the start string, and/or that the start string will only be present once, then you can start with testing each line for only the start string. Once you find it, then you test each line for only the end string.
An alternative to using FReadLine in a loop would be to use Grep/Q/INDX. Grep will basically do the same thing you're doing now (run through the file a line at a time, attempting to match each line to a regular expression), but it will execute as C++ code instead of an Igor loop, which might be substantially faster. Note that since Grep uses the same code under the hood as FReadLine to read the file, it too is slower than it could be (until IP9).
If you use the Grep approach, you have two options. The first is to call Grep twice, once with an expression for the start string and once for the end string. This will result in you looping through the file twice, but coming up with the regular expression is simpler and your code might be more reliable. The other option would be to create a regular expression that matches either the start or the end string, and call Grep only one time. The W_Index wave would then have 2 entries (presumably), and you would treat the first as the start row number and the second as the end row number. If you can make assumptions about your files this may be the best approach.
I also recommend that you change extractGND_En to return either a wave reference to the output wave (what is created by this line: "Duplicate/O GND_En_,$currentGNDname") or a string containing the name of the wave created by that same line. The reason I recommend this is that your WaveList command that you use later will give you a list of waves with undefined order. Currently the order is the order in which the wave was added to the data folder, which is probably what you want, but it is possible that future changes to your code (eg. loading in multiple threads) could result in the waves being added to the data folder in a different order, which might throw off your downstream code. An alternative would be to just call SortList on the output of WaveList to sort the wave names alphanumerically. That way your GNDList string always has a defined order.
July 8, 2020 at 08:48 am - Permalink
These are all great suggestions!
Making the code more modular is definitely one of the things I'm aiming for so that seems like a good strategy to accomplish that.
I can definitely rely on the start string only happening once and the end string always happening after the start string so I could refactor the code to only loop through the file once.Thanks for pointing that out! Switching to GREP might be worthwhile since some of the files I'm reading can be quite large at times.
There is one more important thing that I failed to mention regarding the different sections and it's that they have different column structures. Therefore, I will need a different /B flag for the LoadWave command. This is one of the main reasons I ended up structuring my routine the way I did since I wasn't sure if there was a better way. Is there a workaround for this?
Thanks for all the help!
Here are the columnInfoString values I use for the different sections:
Section 1 columnInfoStr
columnInfoStr += "N='_skip_',W=20;"
columnInfoStr += "N='_skip_',W=3;"
columnInfoStr += "N='GND_En_',W=16;"
columnInfoStr += "N='_skip_',W=40;"
Section 2 columnInfoStr
columnInfoStr += "N='_skip_',W=4;"
columnInfoStr += "N='LUMO_En_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"
Section 3 columnInfoStr
columnInfoStr += "N='LUMO_POS_',W=7;"
columnInfoStr += "N='OCCU_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"
July 8, 2020 at 04:29 pm - Permalink
Your different columnInfoStr strings could be input parameters to the common routine that loads a section of the file.
July 23, 2020 at 07:15 am - Permalink
Thank you so much everyone for the suggestions! I got it working via a combination of the different approaches mentioned in the thread.
July 23, 2020 at 09:56 am - Permalink