The litsource program is worth study as an example not
only of literate programming but also of how easy it is to
process XML files in Python.
An earlier version of this script used the Document
Object Model (DOM) to build a tree representation of the
entire DocBook document. It then used XPath to pull from
this tree the set of
programlisting elements that had a
role attribute whose value started
with "outFile:". The code was
straightforward and quite short. See the literate exposition of the DOM version.
However, for large Docbook files, the DOM technique became somewhat time-consuming. For example, a 5400-line DocBook file took about 60 seconds to process. The current version, using Python's SAX interface (Simple API for XML), processed this same file in 0.11 seconds, a better than 500-fold performance improvement.
SAX is a completely different approach to XML processing. It is a serial, event-based technique. The SAX interface reads through the XML and classifies each bit as a start tag, end tag, chunk of text, comment, and so on. The programmer defines a set of “handlers” that are called whenever specific types of content are encountered.
Because the litsource script cares only about the text
inside selected programlisting
elements, there is no need to build a tree of the entire
DocBook file. All we need is a SAX interface with three
handlers:
One handler observes each start tag that goes by.
When it sees a <programlisting
role='outFile: tag, it remembers that we are now
inside a code fragment, and it also remembers the
filename'>, and opens an output file by that
name.
filename
Another handler is called whenever the SAX interface sees text content. If we are currently inside a code fragment, that text content is written to the current output file.
A third handler observes each end tag. If it is a
</programlisting tag,
we note that we're no longer inside a code fragment.
Here are some good resources for learning about Python's XML libraries:
See the Python web site for information and downloads for the Python language.
Python & XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly Press, 2002, ISBN 0-596-00128-2) is an excellent overview of the major approaches to Python XML processing, with copious examples.
In the online Python Library Reference, see the
documentation for the xml.sax module.
This program was written using the Cleanroom or zero-defect methodology. The best introduction to the method is given in Stavely, Allan M., Toward Zero-defect Programming, Addison-Wesley, 1999, ISBN 0-201-38595-3. Also see my Cleanroom pages for a discussion of how I practice the methodology.