One of the most significant advantages of the lxml package
over the other ElementTree-style packages is its support for the
full XPath language. XPath expressions give you a much
more powerful mechanism for selecting and retrieving parts
of a document, compared to the relatively simple
“path” syntax used in Section 8.1, “ElementTree.find()”.
If you are not familiar with XPath, see these sources:
XSLT reference, specifically the section entitled “XPath reference”.
The standard, XML Path Language (XPath), Version 1.0.
Keep in mind that every XPath expression is evaluated using three items of context:
The context node is the starting point for any operations whose meaning is relative to some point in the tree.
The context size is the number of elements that are children of the context node's parent, that is, the context node and all its siblings.
The context position is the context node's position relative to its siblings, counting the first sibling as position 1.
You can evaluate an XPath expression by using the s.xpath( method on
either an s)Element instance or an ElementTree instance. See Section 9.21, “Element.xpath(): Evaluate an XPath
expression” and Section 8.6, “ElementTree.xpath(): Evaluate an
XPath expression”.
Depending on the XPath expression you use, these .xpath() methods may return one of several kinds
of values:
For expressions that return a Boolean value, the .xpath() method will return True or False.
Expressions with a numeric result will return a Python
float (never an
int).
Expressions with a string result will return a Python
str (string) or unicode
value.
Expressions that produce a list of values, such as
node-sets, will return a Python list.
Elements of this list may in turn be any of several
types:
Elements, comments, and processing instructions
will be represented as lxml Element,
Comment, and ProcessingInstruction instances.
Text content and attribute values are returned as
Python str (string) instances.
Namespace declarations are returned as a two-tuple
(.
prefix,
namespaceURI)
For further information on lxml's XPath features, see XML Path Language
(XPath).
Here is an example of a situation where an XPath
expression can save you a lot of work. Suppose you have a
document with an element called para that
represents a paragraph of text. Further suppose that your
para has a mixed-content model, so its
content is a free mixture of text and several kinds of
inline markup. Your application, however, needs to extract
just the text in the paragraph, discarding any and all
tags.
Using just the classic ElementTree interface, this would require
you to write some kind of function that recursively walks
the para element and its subtree, extracting
the .text and .tail
attributes at each level and eventually gluing them all
together.
However, there is a relatively simple XPath expression that does all this for you:
descendant-or-self::text()
The “descendant-or-self::” is an
axis selector that limits the search to the context node,
its children, their children, and so on out to the leaves
of the tree. The “text()”
function selects only text nodes, discarding any elements,
comments, and other non-textual content. The return
value is a list of strings.
Here's an example of this expression in practice.
>>> node=etree.fromstring('''<a>
... a-text <b>b-text</b> b-tail <c>c-text</c> c-tail
... </a>''')
>>> alltext = node.xpath('descendant-or-self::text()')
>>> alltext
['\n a-text ', 'b-text', ' b-tail ', 'c-text', ' c-tail\n']
>>> clump = "".join(alltext)
>>> clump
'\n a-text b-text b-tail c-text c-tail\n'
>>>