Next / Previous / Contents / TCC Help System / NM Tech homepage

10. XPath processing

One of the most significant advantages of the lxml package over the other ElementTree-style packages is its support for the full XPath language. XPath expressions give you a much more powerful mechanism for selecting and retrieving parts of a document, compared to the relatively simple “path” syntax used in Section 8.1, “ElementTree.find().

If you are not familiar with XPath, see these sources:

Keep in mind that every XPath expression is evaluated using three items of context:

You can evaluate an XPath expression s by using the .xpath(s) method on either an Element instance or an ElementTree instance. See Section 9.21, “Element.xpath(): Evaluate an XPath expression” and Section 8.6, “ElementTree.xpath(): Evaluate an XPath expression”.

Depending on the XPath expression you use, these .xpath() methods may return one of several kinds of values:

For further information on lxml's XPath features, see XML Path Language (XPath).

10.1. An XPath example

Here is an example of a situation where an XPath expression can save you a lot of work. Suppose you have a document with an element called para that represents a paragraph of text. Further suppose that your para has a mixed-content model, so its content is a free mixture of text and several kinds of inline markup. Your application, however, needs to extract just the text in the paragraph, discarding any and all tags.

Using just the classic ElementTree interface, this would require you to write some kind of function that recursively walks the para element and its subtree, extracting the .text and .tail attributes at each level and eventually gluing them all together.

However, there is a relatively simple XPath expression that does all this for you:

descendant-or-self::text()

The “descendant-or-self::” is an axis selector that limits the search to the context node, its children, their children, and so on out to the leaves of the tree. The “text()” function selects only text nodes, discarding any elements, comments, and other non-textual content. The return value is a list of strings.

Here's an example of this expression in practice.

>>> node=etree.fromstring('''<a>
...   a-text <b>b-text</b> b-tail <c>c-text</c> c-tail
... </a>''')
>>> alltext = node.xpath('descendant-or-self::text()')
>>> alltext
['\n  a-text ', 'b-text', ' b-tail ', 'c-text', ' c-tail\n']
>>> clump = "".join(alltext)
>>> clump
'\n  a-text b-text b-tail c-text c-tail\n'
>>>