Next / Previous / Contents / TCC Help System / NM Tech homepage

4. Literate exposition of the litsource program itself

The litsource program is worth study as an example not only of literate programming but also of how easy it is to process XML files in Python.

4.1. Design notes

An earlier version of this script used the Document Object Model (DOM) to build a tree representation of the entire DocBook document. It then used XPath to pull from this tree the set of programlisting elements that had a role attribute whose value started with "outFile:". The code was straightforward and quite short. See the literate exposition of the DOM version.

However, for large Docbook files, the DOM technique became somewhat time-consuming. For example, a 5400-line DocBook file took about 60 seconds to process. The current version, using Python's SAX interface (Simple API for XML), processed this same file in 0.11 seconds, a better than 500-fold performance improvement.

SAX is a completely different approach to XML processing. It is a serial, event-based technique. The SAX interface reads through the XML and classifies each bit as a start tag, end tag, chunk of text, comment, and so on. The programmer defines a set of “handlers” that are called whenever specific types of content are encountered.

Because the litsource script cares only about the text inside selected programlisting elements, there is no need to build a tree of the entire DocBook file. All we need is a SAX interface with three handlers:

  1. One handler observes each start tag that goes by. When it sees a <programlisting role='outFile:filename'> tag, it remembers that we are now inside a code fragment, and it also remembers the filename, and opens an output file by that name.

  2. Another handler is called whenever the SAX interface sees text content. If we are currently inside a code fragment, that text content is written to the current output file.

  3. A third handler observes each end tag. If it is a </programlisting tag, we note that we're no longer inside a code fragment.

Here are some good resources for learning about Python's XML libraries:

  • See the Python web site for information and downloads for the Python language.

  • Python & XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly Press, 2002, ISBN 0-596-00128-2) is an excellent overview of the major approaches to Python XML processing, with copious examples.

  • In the online Python Library Reference, see the documentation for the xml.sax module.

This program was written using the Cleanroom or zero-defect methodology. The best introduction to the method is given in Stavely, Allan M., Toward Zero-defect Programming, Addison-Wesley, 1999, ISBN 0-201-38595-3. Also see my Cleanroom pages for a discussion of how I practice the methodology.