Next / Previous / Contents / TCC Help System / NM Tech homepage

11. The art of Web-scraping: Parsing HTML with Beautiful Soup

Web-scraping is a technique for extracting data from Web pages. If everyone on the World Wide Web used valid XHTML, this would be easy. However, in the real world, the vast majority of Web pages use something you could call tag soup—theoretically HTML, but in reality often an unstructured mixture of tags and text.

Fortunately, the lxml module includes a package called BeautifulSoup that attempts to translate tag soup into a tree just as if it came from a valid XHTML page. Naturally this process is not perfect, but there is a very good chance that the resulting tree will have enough predictable structure to allow for automated extraction of the information in it.

Import the BeautifulSoup module like this:

from lxml.html import soupparser

There are two functions in this module.

soupparser.parse(input)

The input argument specifies a Web page's HTML source as either a file name or a file-like object. The return value is an ElementTree instance whose root element is an html element as an Element instance.

soupparser.fromstring(s)

The s argument is a string containing some tag soup. The return value is a tree of nodes representing s. The root node of this tree will always be an html element as an Element instance.