What happens to your application if you read a file that does not conform to the schema? There are two ways to deal with error handling.
If you are a careful and defensive programmer, you will always check for the presence and validity of every part of the XML document you are reading, and issue an appropriate error message. If you aren't careful or defensive enough, your application may crash.
It can make your application a lot simpler if you mechanically validate the input file against the schema that defines its document type.
With the lxml module, the latter approach is inexpensive
both in programming effort and in runtime. You can
validate a document using either of these major schema
languages:
The lxml module can validate a document, in the form of an
ElementTree, against a schema expressed in
the Relax NG notation. For more information about Relax
NG, see Relax NG Compact Syntax (RNC).
A Relax NG schema can use two forms: the compact syntax (RNC), or an XML document type (RNG). If your schema uses RNC, you must translate it to RNG format. The trang utility does this conversion for you. Use a command of this form:
trangfile.rncfile.rng
Once you have the schema available as an .rng file, use these steps to valid an
element tree .
ET
Parse the .rng file into its own ElementTree, as described in Section 6.3, “The ElementTree() constructor”.
Use the constructor etree.RelaxNG( to convert that tree into
a “schema instance,” where S) is the SElementTree instance, containing the schema,
from the previous step.
If the tree is not a valid Relax NG
schema, the constructor will raise an etree.RelaxNGParseError exception.
Use the .validate( method of the schema
instance to validate ET).
ET
This method returns 1 if validates
against the schema, or ET0 if it does
not.
If the method returns 0, the schema
instance has an attribute named .error_log containing all the errors
detected by the schema instance. You can print .error_log.last_error to see the most recent
error detected.
Here is a small program that takes as command line arguments an RNG schema file name followed by one or more XML file names. It validates each XML file in turn against that schema.
#!/usr/bin/env python
#================================================================
# valrelax: Validate files against Relax NG
#----------------------------------------------------------------
# Command line arguments:
# valrelax SCHEMA.rng file1.xml file2.xml ...
#----------------------------------------------------------------
import sys
from lxml import etree
def main():
# [ schemaFile := first command line argument
# fileList := remaining command line arguments ]
schemaFile = sys.argv[1]
fileList = sys.argv[2:]
# [ schema := an etree.RelaxNG instance that represents
# schemaFile ]
schemaDoc = etree.parse ( schemaFile )
try:
schema = etree.RelaxNG ( schemaDoc )
except etree.RelaxNGParseError, details:
print >>sys.stderr, ( "*** %s is not a Relax NG schema: %s" %
(schemaFile, details) )
raise SystemExit
# [ sys.stdout +:= messages about files in fileList
# [ sys.stderr +:= messages about files in fileList that
# don't validate against schema ]
for fileName in fileList:
print "=== Validating", fileName
validate ( schema, fileName )
def validate ( schema, fileName ):
"""Validate one file against the schema.
[ (schema is an etree.RelaxNG instance) and
(fileName is a string) ->
if fileName names a readable, well-formed XML file that
validates against schema ->
sys.stdout +:= success report
else ->
sys.stdout +:= failure report ]
"""
# [ if fileName names a readable, well-formed XML file ->
# doc := an et.ElementTree representing that file
# else ->
# sys.stdout +:= failure report
# return ]
try:
doc = etree.ElementTree ( file=fileName )
except etree.XMLSyntaxError, detail:
print "*** Not well-formed: %s" % detail
return
except IOError, detail:
print "*** I/O error reading '%s': %s" % (fileName, detail)
return
# [ if doc is valid by schema ->
# I
# else ->
# sys.stdout +:= failure report ]
result = schema.validate ( doc )
if not result:
print "*** %s not valid: %s" % (fileName, schema.error_log)
#================================================================
if __name__ == '__main__':
main()