Next / Previous / Contents / TCC Help System / NM Tech homepage

2. How ElementTree represents XML

If you have done XML work using the Document Object Model (DOM), you will find that the lxml package has a quite different way of representing documents as trees. In the DOM, trees are built out of nodes represented as Node instances. Some nodes are Element instances, representing whole elements. Each Element has an assortment of child nodes of various types: Element nodes for its element children; Attribute nodes for its attributes; and Text nodes for textual content.

Here is a small fragment of XHTML, and its representation as a DOM tree:

<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>

The above diagram shows the conceptual structure of the XML. The lxml view of an XML document, by contrast, builds a tree of only one node type: the Element.

The main difference between the ElementTree view used in lxml, and the classical view, is the association of text with elements: it is very different in lxml.

An instance of lxml's Element class contains these attributes:

.tag

The name of the element, such as "p" for a paragraph or "em" for emphasis.

.text

The text inside the element, if any, up to the first child element. This attribute is None if the element is empty or has no text before the first child element.

.tail

The text following the element. This is the most unusual departure. In the DOM model, any text following an element E is associated with the parent of E; in lxml, that text is considered the “tail” of E.

.attrib

A Python dictionary containing the element's XML attribute names and their corresponding values. For example, for the element “<h2 class="arch" id="N15">”, that element's .attrib would be the dictionary “{"class": "arch", "id": "N15"}”.

(element children)

To access sub-elements, treat an element as a list. For example, if node is an Element instance, node[0] is the first sub-element of node. If node doesn't have any sub-elements, this operation will raise an IndexError exception.

You can find out the number of sub-elements using the len() function. For example, if node has five children, len(node) will return a value of 5.

One advantage of the lxml view is that a tree is now made of only one type of node: each node is an Element instance. Here is our XML fragment again, and a picture of its representation in lxml.

<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>

Notice that in the lxml view, the text ", see the\n" (which includes the newline) is contained in the .tail attribute of the em element, not associated with the p element as it would be in the DOM view. Also, the "." at the end of the paragraph is in the .tail attribute of the a (link) element.

Now that you know how XML is represented in lxml, there are three general application areas.