Following on from part 1, here is an example of using HTQL to pull data from a table on a webpage.
We’ll use the Wikipedia list of most expensive football transfers as our source web page. You can check out the list here. On viewing the page and the HTML source you’ll see that the first row of the table is a header row and that the “player”, “from” and “to” columns contain quite a bit of HTML in order to provide a link to the player/team and a graphical link to their country. Our HTQL will need to cut through this to just get the data that we want.
Using the HTQL browser object
In part 1 we just queried some HTML in a variable. In this case we’ll use the browser object to get the HTML from the actual web page and run a query against that.
- Create a new python file – e.g. htql-2-wiki.py
- Edit the file in IDLE and enter the following code:
import htql; # init a browser and go to the wikipedia page b=htql.Browser(); p=b.goUrl("http://en.wikipedia.org/wiki/List_of_most_expensive_association_football_transfers"); # setup the HTQL pattern to query that page pattern="<table>1.<tr>2-0{" pattern+="player=<td>2.<a>2:TX;" pattern+="from=<td>3<a>2:TX;" pattern+="to=<td>4<a>2:TX;" pattern+="fee=<td>5:TX;" pattern+="year=<td>7.<a>1:TX}" # run the query against the page and loop thru the results for r in htql.HTQL(p[0], pattern): print(r);
- A brief run through of the code:
- The first part of the code instantiates a browser object and uses the goUrl method to go to the page
- The second part specifies the query that we’re going to run against the page. The query is built up over multiple lines and says: find me the first table on the page (<table>1); then for the second to the last row of that table (.<tr>2-0); create tuples of the data (indicated by the curly braces) where player = the text within the second hyperlink in the second column (player=<td>2.<a>2:TX); and so on.
- More on the query syntax in a later post!
- The final part of the code runs the query against the page, loops over the tuples returned and prints them to screen.
- If you run the code (F5) you should see the following output:
Outputting the results to a CSV
As well as outputting the results to screen, we can output to a CSV.
- Amend the final part of your Python program as follows:
: # open a file to put the data in file = open("transfers.csv", "wb") # run the query against the page and loop thru the results for r in htql.HTQL(p[0], pattern): print(r); file.write("\"" + r[0] + "\",") file.write("\"" + r[1] + "\",") file.write("\"" + r[2] + "\",") file.write("\"" + r[3] + "\",") file.write("\"" + r[4] + "\"\n") #close the file file.close()
- You’ll note the file open and close and the addition of file.write(…) lines within the loop.
- Run this version and see what you get! You should see a transfers.csv file in the same folder as the Python program.