Next / Previous / Contents / TCC Help System / NM Tech homepage

Abstract

Describes the lxml package for reading and writing XML files with the Python programming language.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

This work is licensed under a Creative Commons BY-NC Creative Commons Attribution-NonCommercial 3.0 Unported License.

Table of Contents

1. Introduction: Python and XML
2. How ElementTree represents XML
3. Reading an XML document
4. Handling multiple namespaces
4.1. Glossary of namespace terms
4.2. The syntax of multi-namespace documents
4.3. Namespace maps
5. Creating a new XML document
6. Modifying an existing XML document
7. Features of the etree module
7.1. The Comment() constructor
7.2. The Element() constructor
7.3. The ElementTree() constructor
7.4. The fromstring() function: Create an element from a string
7.5. The parse() function: build an ElementTree from a file
7.6. The ProcessingInstruction() constructor
7.7. The QName() constructor
7.8. The SubElement() constructor
7.9. The tostring() function: Serialize as XML
7.10. The XMLID() function: Convert text to XML with a dictionary of id values
8. class ElementTree: A complete XML document
8.1. ElementTree.find()
8.2. ElementTree.findall(): Find matching elements
8.3. ElementTree.findtext(): Retrieve the text content from an element
8.4. ElementTree.getiterator(): Make an iterator
8.5. ElementTree.getroot(): Find the root element
8.6. ElementTree.xpath(): Evaluate an XPath expression
8.7. ElementTree.write(): Translate back to XML
9. class Element: One element in the tree
9.1. Attributes of an Element instance
9.2. Accessing the list of child elements
9.3. Element.append(): Add a new element child
9.4. Element.clear(): Make an element empty
9.5. Element.find(): Find a matching sub-element
9.6. Element.findall(): Find all matching sub-elements
9.7. Element.findtext(): Extract text content
9.8. Element.get(): Retrieve an attribute value with defaulting
9.9. Element.getchildren(): Get element children
9.10. Element.getiterator(): Make an iterator to walk a subtree
9.11. Element.getroottree(): Find the ElementTree containing this element
9.12. Element.insert(): Insert a new child element
9.13. Element.items(): Produce attribute names and values
9.14. Element.iterancestors(): Find an element's ancestors
9.15. Element.iterchildren(): Find all children
9.16. Element.iterdescendants(): Find all descendants
9.17. Element.itersiblings(): Find other children of the same parent
9.18. Element.keys(): Find all attribute names
9.19. Element.remove(): Remove a child element
9.20. Element.set(): Set an attribute value
9.21. Element.xpath(): Evaluate an XPath expression
10. XPath processing
10.1. An XPath example
11. The art of Web-scraping: Parsing HTML with Beautiful Soup
12. Automated validation of input files
12.1. Validation with a Relax NG schema
12.2. Validation with an XSchema (XSD) schema
13. etbuilder.py: A simplified XML builder module
13.1. Using the etbuilder module
13.2. CLASS(): Adding class attributes
13.3. FOR(): Adding for attributes
13.4. subElement(): Adding a child element
13.5. addText(): Adding text content to an element
14. Implementation of etbuilder
14.1. Features differing from Lundh's original
14.2. Prologue
14.3. CLASS(): Helper function for adding CSS class attributes
14.4. FOR(): Helper function for adding XHTML for attributes
14.5. subElement(): Add a child element
14.6. addText(): Add text content to an element
14.7. class ElementMaker: The factory class
14.8. ElementMaker.__init__(): Constructor
14.9. ElementMaker.__call__(): Handle calls to the factory instance
14.10. ElementMaker.__handleArg(): Process one positional argument
14.11. ElementMaker.__getattr__(): Handle arbitrary method calls
14.12. Epilogue
14.13. testetbuilder: A test driver for etbuilder
15. rnc_validate: A module to validate XML against a Relax NG schema
15.1. Design of the rnc_validate module
15.2. Interface to the rnc_validate module
15.3. rnc_validate.py: Prologue
15.4. RelaxException
15.5. class RelaxValidator
15.6. RelaxValidator.validate()
15.7. RelaxValidator.__init__(): Constructor
15.8. RelaxValidator.__makeRNG(): Find or create an .rng file
15.9. RelaxValidator.__getModTime(): When was this file last changed?
15.10. RelaxValidator.__trang(): Translate .rnc to .rng format
16. rnck: A standalone script to validate XML against a Relax NG schema
16.1. rnck: Prologue
16.2. rnck: main()
16.3. rnck: checkArgs()
16.4. rnck: usage()
16.5. rnck: fatal()
16.6. rnck: message()
16.7. rnck: validateFile()
16.8. rnck: Epilogue

1. Introduction: Python and XML

With the continued growth of both Python and XML, there is a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the lxml package has two big advantages:

  • Performance. Reading and writing even fairly large XML files takes an almost imperceptible amount of time.

  • Ease of programming. The lxml package is based on ElementTree, which Fredrik Lundh invented to simplify and streamline XML processing.

lxml is similar in many ways to two other, earlier packages:

  • Fredrik Lundh continues to maintain his original version of ElementTree.

  • xml.etree.ElementTree is now an official part of the Python library. There is a C-language version called cElementTree which may be even faster than lxml for some applications.

However, the author prefers lxml for providing a number of additional features that make life easier. In particular, support for XPath makes it considerably easier to manage more complex XML structures.