HTQL | Bitfern

Using the HTQL browser object

In part 1 we just queried some HTML in a variable. In this case we’ll use the browser object to get the HTML from the actual web page and run a query against that.

Create a new python file – e.g. htql-2-wiki.py

Edit the file in IDLE and enter the following code:

import htql;

# init a browser and go to the wikipedia page
b=htql.Browser(); 
p=b.goUrl("http://en.wikipedia.org/wiki/List_of_most_expensive_association_football_transfers");

# setup the HTQL pattern to query that page
pattern="<table>1.<tr>2-0{"
pattern+="player=<td>2.<a>2:TX;"
pattern+="from=<td>3<a>2:TX;"
pattern+="to=<td>4<a>2:TX;"
pattern+="fee=<td>5:TX;"
pattern+="year=<td>7.<a>1:TX}"

# run the query against the page and loop thru the results
for r in htql.HTQL(p[0], pattern):
    print(r);

A brief run through of the code:

The first part of the code instantiates a browser object and uses the goUrl method to go to the page
The second part specifies the query that we’re going to run against the page. The query is built up over multiple lines and says: find me the first table on the page (<table>1); then for the second to the last row of that table (.<tr>2-0); create tuples of the data (indicated by the curly braces) where player = the text within the second hyperlink in the second column (player=<td>2.<a>2:TX); and so on.
More on the query syntax in a later post!
The final part of the code runs the query against the page, loops over the tuples returned and prints them to screen.

If you run the code (F5) you should see the following output:

Outputting the results to a CSV

As well as outputting the results to screen, we can output to a CSV.

Amend the final part of your Python program as follows:

:

# open a file to put the data in
file = open("transfers.csv", "wb")

# run the query against the page and loop thru the results
for r in htql.HTQL(p[0], pattern):
    print(r);
    file.write("\"" + r[0] + "\",")
    file.write("\"" + r[1] + "\",")
    file.write("\"" + r[2] + "\",")
    file.write("\"" + r[3] + "\",")
    file.write("\"" + r[4] + "\"\n")

#close the file    
file.close()

You’ll note the file open and close and the addition of file.write(…) lines within the loop.

Run this version and see what you get! You should see a transfers.csv file in the same folder as the Python program.

HTQL – Hyper-Text Query Language – is a language for querying and extracting content from HTML pages. If SQL is a language to get data from tables within a database, then HTQL is a language to get data from webpages on the internet. It is useful when you need to pull data from the web and there is no web service available to use. An example might be to pull population statistics from Wikipedia.

Note that the example below uses Python 2.7.3 and HTQL for Python 2.7. I’ve done this because I’m ultimately deploying code to a Microsoft Azure website, which only supported Python 2.7.3 or 3.4.0 at the time of writing, and unfortunately there isn’t a version of HTQL for Python 3.4.0! You may want consider developing with Python 3.3.5 and HTQL for Python 3.3.

Download and install Python

Download Python (I downloaded the 32-bit MSI installer):
https://www.python.org/downloads, or directly:
https://www.python.org/ftp/python/2.7.3/python-2.7.3.msi
Run the installer, letting it add Python to your path if prompted (if you are not prompted for this then you probably have an existing Python install and may want consider whether or not to update your path manually).

Check Python

If you want to double check that Python is working then you can create and run a simple program to output the Python version as follows:

Setup a folder for development
Create a new file in that folder called test.py
Right click that file and you should be able to select “Edit with IDLE” (IDLE is a basic IDE for Python development). You should see a blank window like this:

Enter the following code:

import sys
sys.stdout.write("hello from Python %s\n" % (sys.version,))

And then hit F5 to run the program (or select Run > Run Module from the menu)
A second window should open – the Python Shell – and your program should run:

Download and install HTQL

Next we’ll download and install HTQL for Python:

Download the HTQL Python library:
http://htql.net/, or directly:
http://htql.net/Python27/htql.zip
You’ll probably want to grab the HTQL manual and the HTQL Python manual whilst you’re there. The first of these describes HTQL itself, the second describes using HTQL within a Python program.
There is only one file in the zip – htql.pyd – extract this file and place a copy in <install location>\Python27\Libs (where <install location> is the location in which you installed Python in the steps above).

Run a simple HTQL script

You can check that HTQL is working using the basic example from the HTQL python manual, as described below. Note that this example doesn’t go out to a webpage; it just parses a chunk of HTML that is setup in a local variable.

Create a new python file – e.g. htql-1.py

Edit the file in IDLE and enter the following code:

import htql;
page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";
for url, text in htql.HTQL(page, query):
     print(url, text);

The code above does the following:
- pulls in the HTQL library – import htql;
- Sets up a dummy HTML page with three links / anchors.
- Specifies a HTQL query which says “for all of the anchor <a> tags in the HTML, give me the url / href and the text within the anchor” (tx = the text between the start and end tags).
- The code then loops through the results of running that query against the dummy page and prints out each url and text found.
If you run the code (hit F5) you should see the following output:
Success!

Bitfern

Archives: HTQL

Intro to HTQL with Python (2)