Next / Previous / Contents / TCC Help System / NM Tech homepage

10. Automated validation of input files

What happens to your application if you read a file that does not conform to the schema? There are two ways to deal with error handling.

With the lxml module, the latter approach is inexpensive both in programming effort and in runtime. You can validate a document using either of these major schema languages:

10.1. Validation with a Relax NG schema

The lxml module can validate a document, in the form of an ElementTree, against a schema expressed in the Relax NG notation. For more information about Relax NG, see Relax NG Compact Syntax (RNC).

A Relax NG schema can use two forms: the compact syntax (RNC), or an XML document type (RNG). If your schema uses RNC, you must translate it to RNG format. The trang utility does this conversion for you. Use a command of this form:

trang file.rnc file.rng

Once you have the schema available as an .rng file, use these steps to valid an element tree ET.

  1. Parse the .rng file into its own ElementTree, as described in Section 6.3, “The ElementTree() constructor”.

  2. Use the constructor etree.RelaxNG(S) to convert that tree into a “schema instance,” where S is the ElementTree instance, containing the schema, from the previous step.

    If the tree is not a valid Relax NG schema, the constructor will raise an etree.RelaxNGParseError exception.

  3. Use the .validate(ET) method of the schema instance to validate ET.

    This method returns 1 if ET validates against the schema, or 0 if it does not.

    If the method returns 0, the schema instance has an attribute named .error_log containing all the errors detected by the schema instance. You can print .error_log.last_error to see the most recent error detected.

Here is a small program that takes as command line arguments an RNG schema file name followed by one or more XML file names. It validates each XML file in turn against that schema.

#!/usr/bin/env python
#================================================================
# valrelax:  Validate files against Relax NG
#----------------------------------------------------------------
# Command line arguments:
#   valrelax SCHEMA.rng file1.xml file2.xml ...
#----------------------------------------------------------------

import sys
from lxml import etree

def main():
    # [ schemaFile  :=  first command line argument
    #   fileList  :=  remaining command line arguments ]
    schemaFile  =  sys.argv[1]
    fileList  =  sys.argv[2:]

    # [ schema  :=  an etree.RelaxNG instance that represents
    #               schemaFile ]
    schemaDoc = etree.parse ( schemaFile )
    try:
        schema = etree.RelaxNG ( schemaDoc )
    except etree.RelaxNGParseError, details:
        print >>sys.stderr, ( "*** %s is not a Relax NG schema: %s" %
                              (schemaFile, details) )
        raise SystemExit

    # [ sys.stdout  +:=  messages about files in fileList
    # [ sys.stderr  +:=  messages about files in fileList that
    #                    don't validate against schema ]
    for  fileName in fileList:
        print "=== Validating", fileName
        validate ( schema, fileName )

def validate ( schema, fileName ):
    """Validate one file against the schema.

      [ (schema is an etree.RelaxNG instance) and
        (fileName is a string) ->
          if  fileName names a readable, well-formed XML file that
          validates against schema ->
            sys.stdout  +:=  success report
          else ->
            sys.stdout  +:=  failure report ]
    """

    # [ if fileName names a readable, well-formed XML file ->
    #     doc  :=  an et.ElementTree representing that file
    #   else ->
    #     sys.stdout  +:=  failure report
    #     return ]
    try:
        doc  =  etree.ElementTree ( file=fileName )
    except etree.XMLSyntaxError, detail:
        print "*** Not well-formed: %s" % detail
        return
    except IOError, detail:
        print "*** I/O error reading '%s': %s" % (fileName, detail)
        return

    # [ if doc is valid by schema ->
    #     I
    #   else ->    
    #     sys.stdout  +:=  failure report ]
    result  =  schema.validate ( doc )
    if  not result:
        print "*** %s not valid: %s" % (fileName, schema.error_log)    
      

#================================================================
if __name__ == '__main__':
    main()