HTML table to text and numeric waves
ajleenheer
Currently I'm doing the following workflow which works for one specific webpage but is pretty kludgy and probably not very general.
S_ServerResponse=ReplaceString("<td>",S_ServerResponse,";")
or S_ServerResponse=ReplaceString(" align=\"center\"",S_ServerResponse,"")
etc. etc.StringFromList(j,StringFromList(i,S_ServerResponse,"|"),";")
)Now I have a 2-D text wave with the HTML table contents. Next I'd like to convert each column to the appropriate kind of wave, either text, numeric, or date/time. Currently I'm exporting the 2D table to a CSV file using Save, then re-importing it using LoadWave such that Igor automatically determines the column type. Is there a cleaner way to do that without the save/load?
I can't think of any other automatic way to do this. I think your approach is actually pretty clever.
Is there any chance you can get the data from the web server as JSON?
October 6, 2016 at 09:30 am - Permalink
//Example usage: ParseHTMLTable(FetchURL("http://myurlwithtable.com"))
//This attempts to load the last table in a webpage into IGOR waves. Doesn't work with nested tables.
string htmltext
htmltext=ReplaceString("\n",htmltext,"")//delete any line breaks
//Get the last table: create a list where each "<table" entry is separated by ^
htmltext=ReplaceString("<table",htmltext,"^<table")
htmltext=StringFromList(ItemsInList(htmltext,"^")-1,htmltext,"^") //Assuming the table of interest is the last table on the webpage
//Determine if the table has headers: any <th> tags?
variable headerexists=StringMatch(htmltext,"*<th*")
//Create a list based on tr and td and th; rows are separated by ~ and columns are separated by `
htmltext=ReplaceString("</tr>",htmltext,"~")
htmltext=ReplaceString("</td>",htmltext,"`")
htmltext=ReplaceString("</th>",htmltext,"`")
variable nrows =ItemsInList(htmltext,"~")-1 //subtract 1 because there's always a dummy row containing </table>
variable ncols=ItemsInList(StringFromList(0,htmltext,"~"),"`")
Print "Importing HTML table with "+num2str(nrows)+" rows and "+num2str(ncols)+" columns"
//Put the (messy with HTML code) items into a text wave
Make/O/T/N=(nrows,ncols) htmlTable=StringFromList(q,StringFromList(p,htmltext,"~"),"`") //each table row goes into a row of textWave
//Clean up each item of the text wave in turn, given that each entry is proceeded by >
variable i,j
variable arrowposition=0
string cellstring=""
for (i=0;i<nrows;i+=1)
for(j=0;j<ncols;j+=1)
cellstring=htmlTable[i][j]
arrowposition=strsearch(cellstring,">",inf,1)
htmlTable[i][j]=cellstring[arrowposition+1,strlen(cellstring)]
endfor
endfor
//Create a string containing the wave names based on the headers if they exist
string wavenames=""
for(i=0;i<ncols;i+=1)
if(headerexists)
wavenames=wavenames+"N='"+CleanupName(htmltable[0][i],1)+"';"
else
wavenames=wavenames+"N='htmlColumn"+num2str(i+1)+"';"
endif
endfor
if(headerexists)
Print "Getting wave names from table header."
DeletePoints/M=0 0,1,htmlTable //remove the header row now that we know the names
endif
Edit htmltable //display the text table
//Rather than messing about with column types, let's just save & reload to let Igor worry about column types
Save/J/O/P=IgorUserFiles htmlTable as "HTMLtable.txt" //delimiter is tab; assumes table has no tabs embedded
LoadWave/E=1/Q/O/A/J/V={"\t"," $",0,0}/D/B=waveNames/K=0/P=IgorUserFiles "HTMLtable.txt" //set E=0 if the loaded waves should not be displayed in a table
end
October 10, 2016 at 10:43 am - Permalink
--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH
October 10, 2016 at 04:30 pm - Permalink
You might get better responses if you include a link to the source HTML that you're processing, and if you clarified exactly what it is that you want the output to look like. Reading only your code, it's hard to get an idea of what the source data looks like.
October 11, 2016 at 05:59 am - Permalink
I've attached an example HTML file similar to what I'm processing that contains some nonsense data. For output, I want each column loaded into Igor as the appropriate kind of 1D wave. My code works relatively well on this example, other than the column with values separated by commas. I've only tested it on a couple other webpages. To load the example file, run from the command line:
string buffer
Open/R refNumHTML as ""
FReadLine/T="" refNumHTML, buffer
Close/A
ParseHTMLTable(buffer)
October 11, 2016 at 10:48 am - Permalink
October 11, 2016 at 12:03 pm - Permalink
October 11, 2016 at 02:39 pm - Permalink