Importing Data from DFT Calculations

Hi there,

I'm trying to import data from DFT calculations I've been running. The problem is that the data that I'm trying to import is not always located in the same line and the files are massive. The beginning of the data I want can be identified by a couple of keywords (i.e. Rigid spectral shift, NEXAFS, etc) as shown below from an excerpt of the original file:

Xray absorption (XAS, NEXAFS) calculation
Core hole found (by occ.) in alpha space, orbital # 6
Core hole located at center C1

Orbital energy core hole = -10.69511 H ( -291.03103 eV)
Rigid spectral shift = 0.00000 eV
Ionization potential = 291.03103 eV

Core -> unocc. excitations (X-ray absorption), dipole only :

E (eV) OSCL oslx osly oslz osc(r2)
---------------------------------------------------------------------------------------
# 1 287.0055 0.007305 0.014191 0.028940 0.000089 0.0000 72.7191
# 2 287.9669 0.000221 0.002439 0.005033 0.000011 0.0000 126.2999
# 3 288.4817 0.004180 -0.010701 -0.021838 -0.000099 0.0000 147.8532
# 4 289.1079 0.000177 0.002519 0.004267 -0.000622 0.0000 128.5743

My current workaround involves copying and pasting the portion of the data I want (everything in the first 6 columns below the dotted line) into a separate text file and importing that through IGOR, this is however, tedious since the original files are extremely large (can be in the millions of lines) and would like to automate the process if possible(example below):

E (eV) OSCL oslx osly oslz osc(r2)
---------------------------------------------------------------------------------------
1 287.0055 0.007305 0.014191 0.028940 0.000089 0.0000 72.7191
2 287.9669 0.000221 0.002439 0.005033 0.000011 0.0000 126.2999
3 288.4817 0.004180 -0.010701 -0.021838 -0.000099 0.0000 147.8532
4 289.1079 0.000177 0.002519 0.004267 -0.000622 0.0000 128.5743

Out of those millions of lines, I only need about 1000 that contain my data. Here is the current import procedure I'm using for my workaround:

#pragma TextEncoding = "UTF-8"
#pragma rtGlobals=3		// Use modern global access method and strict wave access.

Function LoadDFT(DFTdata,pathName)

	String DFTdata	//Desired name of file
	String pathName	//Symbolic path where desired file is present
	String DFTFolder=GetDataFolder(1)		
	String foldername= "root:"+RemoveEnding(DFTData,".out")	//Names folder by taking the file name and removing the .out, necessary for proper file parsing
	String DFTData2=RemoveEnding(DFTData,".out")			//Names the files by taking the file name and removing the .out ending
	String columnInfoStr = " "									//Contains set of names for each column in the .out file
	columnInfoStr += "C=1,F=0,W=3,N='_skip_';"
	columnInfoStr += "C=1,F=0,W=11,N=EnergyH_"+DFTdata2+";"
	columnInfoStr += "C=1,F=0,W=11,N=OS_"+DFTdata2+";"
	columnInfoStr += "C=1,F=0,W=11,N=TDMx_"+DFTdata2+";"
	columnInfoStr += "C=1,F=0,W=11,N=TDMy_"+DFTdata2+";"
	columnInfoStr += "C=1,F=0,W=11,N=TDMz_"+DFTdata2+";"
	columnInfoStr += "C=1,F=0,W=18,N='_skip_';"
	columnInfoStr += "C=1,F=0,W=18,N='_skip_';"
	
	NewDataFolder/O/S $foldername							//Makes a data folder based 
	print DFTData											//prints the loaded files
	LoadWave/J/B=columnInfoStr/D/W/E=0/K=0/V={"\t, "," $",1,1}/F={6,1,0}/N/O/Q/P=$pathName DFTData
End


I've attached a much smaller version of the DFT output that I'm trying to import. I'm using Igor 8 in case that helps. Any suggestions? Thanks for your help in advance!

Best,

Vic
example_1.txt (246.45 KB)
Do you know the exact line were loading should start? If yes you can use /L from LoadWave. If no you can try to find it via Open/FReadLine etc. If FreadLine is too slow and the file fits in memory you can also use FBinRead to read it into one shot.
So what you need to know is the start (and perhaps end) position of the data in the file.

You could use
Open
to open the file,
FReadline
in a loop with a counter to find the start position by text comparison,
Close
to close the file, then construct a
LoadWave
command to skip the unneeded lines and load the data.

Alternatively,
Grep
can be used to find a text marker in the file and return the line through V_startParagraph.
Hmmm, so I tried to modify hrodstein's example for my data set, but I keep getting the message "No data found in file". For some reason, it's not finding the keyword I specify. I've tried using different keywords but none are being registered/found. I think the problem might be due to the buffer or the text strings. Any thoughts?


#pragma rtGlobals=3		// Use modern global access method and strict wave access.

Function FindFirstDataLine(pathName, filePath)
	String pathName		// Name of symbolic path or ""
	String filePath			// Name of file or partial path relative to symbolic path.
 
	Variable refNum
 
	Open/R/P=$pathName refNum as filePath
	
	String buffer, text
	Variable line = 0
 
	do
		FReadLine refNum, buffer
		if (strlen(buffer) == 0)
			Close refNum
			//print "Can't find keyword"
			return -1						// The expected keyword was not found in the file
		endif
		text = buffer[0,1]
		if (CmpStr(text," # ") == 0)		
			Close refNum
			return line + 1					// Success: The next line is the first data line.
			print "Success!"
		endif
		line += 1
	while(1)
 
	return -1		// We will never get here
End

Function LoadDataFile(pathName, filePath, extension)
	String pathName		// Name of symbolic path or "" to display dialog.
	String filePath			// Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
	String extension			// e.g., ".dat" for .dat files. "????" for all files.
 
	Variable refNum
 
	// Possibly display Open File dialog.
	if ((strlen(pathName)==0) || (strlen(filePath)==0))
		Open /D /R /P=$pathName /T=(extension) refNum as filePath
		filePath = S_fileName			// S_fileName is set by Open/D
		if (strlen(filePath) == 0)		// User cancelled?
			return -1
		endif
		// filePath is now a full path to the file.
	endif
 
	Variable firstDataLine = FindFirstDataLine(pathName, filePath)
 
	if (firstDataLine < 0)
		Printf "No data found in file %s\r", filePath
		return -1
	endif
 
	LoadWave /J /D /O /E=1 /K=0 /L={0,firstDataLine,1000,2,5} /P=$pathName filePath
 
	return 0
End
The first problem is that this:

text = buffer[0,1]


needs to be changed to this:


text = buffer[0,2]


since you are comparing to " # " (your target string) which is three bytes.

The next problem is that your target string appears in this line (line 2021, zero-based), which is before your data:

 Core hole found (by occ.) in alpha space, orbital #   6


I changed your FindFirstDataLine function to add this:

String targetString = " #   1"
Variable targetStringLength = strlen(targetString)


Then I changed this:

return line+1					// Success: The next line is is the first data line.

to this:

// Print line	// For debugging only
return line					// Success: This is is the first data line.


Now it prints the correct line number: 2032 (zero-based)

The next problem is that your file is space-delimited and the LoadWave operation defaults to comma and tab as delimiters. I fixed this by adding a /V flag:

LoadWave /J /D /O /E=1 /K=0 /L={0,firstDataLine,1000,2,5} /V={" ", "", 0, 0} /P=$pathName filePath


With that, it seems to do the right thing. That is, it loads this:
1 287.0055 0.007305 0.014191 0.02894 ...

Your next task is to change FindFirstDataLine to FindFirstAndLastDataLines.

But first, there is another problem. Lines 999 and 1000 of the data looks like this:
# 999 377.5523 0.000006 -0.000367 0.000177 -0.000682 0.0000 337.5637 #1000 377.7301 0.000005 -0.000327 0.000165 -0.000623 0.0000 151.3746

Because space is a delimiter, and there is no space after the # character in line 1000, that line appears to LoadWave to have one fewer column than line 999. This causes the wrong data to be loaded starting at line 1000. I will give some thought to how to fix this.
Your file is actually a FORTRAN-style fixed field file and so can be loaded using LoadWave/F instead of LoadWave/J.

I made this change. It also requires using the /B flag to specify the width in bytes of each column of data. /B also lets you name each column and skip whatever columns you don't want to load.

I also morphed FindFirstDataLine into FindFirstLineAndNumLines.

The result successfully loads your example file:


#pragma TextEncoding = "UTF-8"
#pragma rtGlobals=3		// Use modern global access method and strict wave access.

Function FindFirstLineAndNumLines(pathName, filePath, firstDataLine, numDataLines)
	String pathName		// Name of symbolic path or ""
	String filePath			// Name of file or partial path relative to symbolic path
	Variable &firstDataLine	// Pass-by-reference output
	Variable &numDataLines	// Pass-by-reference output

	firstDataLine = -1
	numDataLines = -1
 
	Variable refNum
 
	Open/R/P=$pathName refNum as filePath
 
	String buffer, text
	Variable line = 0
	
	String targetString = " #   1"
	Variable targetStringLength = strlen(targetString)
	
	// Find first line
	do
		FReadLine refNum, buffer
		if (strlen(buffer) == 0)
			Close refNum
			return -1						// The expected keyword was not found in the file
		endif
		text = buffer[0,targetStringLength-1]
		if (CmpStr(text,targetString) == 0)	
			firstDataLine = line
			break							// This is is the first data line
		endif
		line += 1
	while(1)
	
	// Find last line
	targetString = " #"
	targetStringLength = strlen(targetString)
	do
		FReadLine refNum, buffer
		if (strlen(buffer) == 0)
			// Ran out of lines - assume this is the last line of data
			line += 1
			break
		endif
		text = buffer[0,targetStringLength-1]
		if (CmpStr(text,targetString) != 0)	// Line does not start with "<space>#>?
			// This is is the line after the last data line
			break	
		endif
		line += 1
	while(1)
	
	numDataLines = line - firstDataLine + 1
	
	// Print firstDataLine, numDataLines		// For debugging only

	Close refNum

	return 0		// Success
End
 
Function LoadDataFile(pathName, filePath, extension)
	String pathName		// Name of symbolic path or "" to display dialog.
	String filePath			// Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
	String extension			// e.g., ".dat" for .dat files. "????" for all files.
 
	Variable refNum
 
	// Possibly display Open File dialog.
	if ((strlen(pathName)==0) || (strlen(filePath)==0))
		Open /D /R /P=$pathName /T=(extension) refNum as filePath
		filePath = S_fileName			// S_fileName is set by Open/D
		if (strlen(filePath) == 0)		// User cancelled?
			return -1
		endif
		// filePath is now a full path to the file.
	endif
 
	Variable firstDataLine, numLines 
	Variable result = FindFirstLineAndNumLines(pathName, filePath, firstDataLine, numLines)
	if (result != 0)
		Printf "No data found in file %s\r", filePath
		return -1
	endif

	// Example Data:
	// #   1   287.0055  0.007305   0.014191   0.028940   0.000089      0.0000       72.7191
	// # 999   377.5523  0.000006  -0.000367   0.000177  -0.000682      0.0000      337.5637
	// #1000   377.7301  0.000005  -0.000327   0.000165  -0.000623      0.0000      151.3746
	
	String columnInfoStr = ""		// Prepare parameter for /B flag
	columnInfoStr += "N='_skip_',W=2;"
	columnInfoStr += "N='Column1',W=4;"
	columnInfoStr += "N='Column2',W=11;"
	columnInfoStr += "N='Column3',W=10;"
	columnInfoStr += "N='Column4',W=11;"
	columnInfoStr += "N='Column5',W=11;"
	columnInfoStr += "N='_skip_',W=11;"
	columnInfoStr += "N='_skip_',W=12;"
	columnInfoStr += "N='_skip_',W=14;"
	
	LoadWave /F={9, 11, 0} /B=columnInfoStr /D /O /E=1 /K=0 /L={0,firstDataLine,numLines,0,0} /P=$pathName filePath
 
	return 0
End

Function Test()
	LoadDataFile("home", "Sample.txt", ".txt")
End
That worked beautifully! Thank you very much hrodstein for the detailed explanation.

Best wishes,

Vic