Next / Previous / Contents / TCC Help System / NM Tech homepage

4. archx.py: The script

The actual archx.py script follows.

4.1. Prologue

The script starts with a comment block pointing back here to the documentation, and two variables for the program name and version number.

archx.py
#!/usr/local/bin/python
#================================================================
# archx.py: Index one archive directory of bird images.
#   For documentation, see:
#    http://www.nmt.edu/~john/slides/archx/
#----------------------------------------------------------------
PROGRAM_NAME      =  "archx.py"
EXTERNAL_VERSION  =  "0.0"

4.2. Imports

Next comes imports. First, standard Python modules. We need sys to get the command line arguments and standard streams. We also need os to read directories.

archx.py
#================================================================
# Imports
#----------------------------------------------------------------

import sys
import os

The Python Imaging Library deals with images: it can size an image and make a thumbnail. For more information, see Python Imaging Library (PIL).

archx.py
import Image

Next we'll need the author's module for generating XML. For sources and documentation, see Python and the XML Document Object Model (DOM) with 4Suite.

archx.py
import xml4create as xc

The birdimages.py module is an interface to the birdimages.xml file that allows us to look up catalog numbers.

archx.py
import birdimages

The next import needs a little explanation. In the code that refers to XML, we don't want to use string constants for element or attribute names like image and cat-no. Preferred practice is to declare a global, manifest constant for each element or attribute name, so that if the schema changes, we can rapidly locate all references to the changed name. So we use a tool named pyrang that extracts all the element and attribute names from a schema, and generates Python assignment statements for each one. See pyrang: A single-sourcing tool for Python-XML applications. The generated file is named rnc_archx.py, and the generated names are prefixed with the string “RNC_”, and suffixed with “_N” for element names and “_A” for attribute names. For example, the name of the image element is RNC_IMAGE_N, and the name of the cat-no attribute is RNC_CAT_NO_A.

archx.py
from rnc_archx import *

4.3. Manifest constants

This section defines constants used throughout the script.

archx.py
#================================================================
# Manifest constants
#----------------------------------------------------------------

4.3.1. ARCHIVE_PREFIX

This string is prefixed to the archive number to get the name of the archive directory.

archx.py
ARCHIVE_PREFIX  =  "bird-"

4.3.2. BIRD_CATALOG_NAME: Name of the bird images catalog

This is the name of the XML file representing the catalog of bird images that we check to insure all archived images are cataloged.

archx.py
BIRD_CATALOG_NAME  =  "birdimages.xml"

4.3.3. INDEX_DIR: Archive index directory name

archx.py
INDEX_DIR  =  "indices/"

4.3.4. THUMB_DIR

Name of the subdirectory where thumbnail images are written.

archx.py
THUMB_DIR  =  "thumb/"

4.3.5. THUMB_WIDE: Width of thumbnails

Maximum width of a thumbnail image in pixels.

archx.py
THUMB_WIDE  =  200

4.3.6. THUMB_HIGH: Height of thumbnails

Maximum height of a thumbnail image in pixels.

archx.py
THUMB_HIGH  =  200

4.3.7. THUMB_EXTENSION: File type for thumbnail images

This is the filename extension that determines what image file type is written in the thumbnail directory.

archx.py
THUMB_EXTENSION  =  ".jpg"

4.4. Main

Note

This script is written under a blanket precondition that we have write access to the thumbnail directory, THUMB_DIR, and the index directory, INDEX_DIR.

Here is the main program, including an intended function for the script as a whole.

archx.py
#================================================================
# Functions and classes
#----------------------------------------------------------------


# - - -   m a i n   - - -

def main():
    """Main program.

      [ let
            archive-set == set of archive directories named
                in command line arguments
            file-set == set of image files in archive directories
                named in command line arguments
        in:
            sys.stdout  +:=  report listing files in file-set
                that appear to be bird images but are not in
                the bird image catalog
            thumbnail directory  :=  thumbnail directory with
                thumbnail images made from file-set
            index directory  :=  index directory with arch-NNN.xml
                files added describing archive-set ]
    """

First we must read the bird image catalog, which will be used to verify that all the archived images are properly cataloged. The .readFile() method is a static method in class ImageCatalog that reads the XML serialization of the image catalog and returns it as an ImageCatalog object.

archx.py
    #-- 1 --
    # [ BIRD_CATALOG_NAME is a readable file valid against
    #   birdimages.rnc ->
    #       imageCatalog  :=  an ImageCatalog object representing
    #           BIRD_CATALOG_NAME ]
    imageCatalog  =  birdimages.ImageCatalog.readFile (
                         BIRD_CATALOG_NAME )

All that remains is to step through the archive names given as command line arguments, and process each one. See Section 4.5, “processArchive(): Process the contents of one archive”.

archx.py
    #-- 2 --
    for  archNo in sys.argv[1:]:
        #-- 2 body --
        # [ sys.stdout  +:=  report listing image files in
        #       archive (archNo) that appear to be bird images
        #       but are not in the bird image catalog
        #   thumbnail directory  :=  thumbnail directory with
        #       thumbnail images made image files in archive (archNo)
        #   index directory  :=  index directory with arch-(archNo).xml
        #       files added describing image files in archive (archNo) ]
        processArchive ( imageCatalog, archNo )

4.5. processArchive(): Process the contents of one archive

This function does all the processing for one archive directory full of images.

archx.py
# - - -   p r o c e s s A r c h i v e   - - -

def processArchive ( imageCatalog, archNo ):
    """Process one archive directory.

      [ (imageCatalog is a birdimages.ImageCatalog instance) and
        (archNo is the numeric part of an archive directory name) ->
          sys.stdout  +:=  report listing image files in
              archive (archNo) that appear to be bird images
              but are not in imageCatalog
          thumbnail directory  :=  thumbnail directory with
              thumbnail images made image files in archive (archNo)
          index directory  :=  index directory with an
              arch-(archNo).xml file added describing image files
              in archive (archNo) ]
    """

The argument is the archive number, which must be appended to ARCHIVE_PREFIX to get the name of the archive directory. We also allocate an empty list imagexList that will accumulate descriptions of each bird image as a sequence of Imagex objects; see Section 4.8, “class Imagex: An object to describe one image”.

archx.py
    #-- 1 --
    # [ archDir  :=  directory name for archive (archNo )
    #   imagexList  :=  a new, empty list ]
    archDir  =  ARCHIVE_PREFIX + archNo
    imagexList  =  []

We use the standard library os.listdir() function to get a list of all the files in that directory. Just for neatness, we'll then sort it.

archx.py
    #-- 2 --
    # [ fileList  :=  list of files in directory (archDir), sorted
    fileList  =  os.listdir ( archDir )
    fileList.sort()

For each file name in fileList, we call Section 4.6, “processFile(): Check one image file” to perform the processing steps for that file, including the generation of an Imagex object that will hold the information we need to write the index file.

archx.py
    #-- 3 --
    # [ imagexList  +:=  Imagex objects representing the
    #       files in fileList that are valid bird images and
    #       indexed in imageCatalog
    #   sys.stdout  +:=  report of bird images in fileList that are
    #       not in imageCatalog ]
    for  fileName in fileList:
        #-- 3 loop --
        # [ if (archDir+fileName) is a bird image in imageCatalog ->
        #       imagexList  +:=  an Imagex object representing
        #                        that image
        #       thumbnail directory  +:=  a thumbnail of that image
        #   else if (archDir+fileName) is a bird image but not
        #   in imageCatalog ->
        #       sys.stdout  +:=  (message about uncataloged image)
        #   else -> I ]

The body of the loop has three steps. First we build the relative path to the image file. Then we call Section 4.6, “processFile(): Check one image file”, which returns an Imagex object if everything went okay; otherwise it returns None. Then if the return value is not None, we can append it to imagexList. See Section 4.6, “processFile(): Check one image file”.

archx.py
        #-- 3.1 --
        # [ pathName  :=  archDir + fileName ]
        pathName  =  os.path.join ( archDir, fileName )

        #-- 3.2 --
        # [ if pathName is a bird image in imageCatalog ->
        #       thumbnail directory  +:=  a thumbnail of that
        #           image
        #       result  :=  an Imagex object representing that
        #           image
        #   else if pathName is a bird image not in imageCatalog
        #   or not a bird image ->
        #       sys.stdout  +:=  error message
        #       result  :=  None
        #   else ->
        #       result  :=  None ]
        result  =  processFile ( imageCatalog, pathName )

        #-- 3.3 --
        if  result is not None:
            imagexList.append ( result )

All that remains is to write the index file for this archive. Needed for Section 4.7, “writeIndex(): Output the index for one archive directory” are two items: the archive number, and the list imagexList containing the details of the valid, cataloged images.

archx.py
    #-- 4 --
    # [ imagexList is a list of Imagex objects ->
    #     index directory  :=  index directory with an
    #     (archDir+".xml") file added representing imagexList ]
    writeIndex ( archDir, imagexList )

4.6. processFile(): Check one image file

This function is called to check one file name. If the file isn't a bird image, it gets ignored. If its name resembles that of a bird image, it is checked to make sure it's in the catalog.

archx.py
# - - -   p r o c e s s F i l e   - - -

def processFile ( imageCatalog, pathName ):
    """Check one file and, if valid, return an Imagex object.

      [ (imageCatalog is an ImageCatalog) and
        (pathName is a nonempty string) ->
          if (pathName looks like a bird image name) and
          (pathName is a catalog number in ImageCatalog) and
          (pathname names a readable image file) ->
            thumbnail directory  +:=  a thumbnail of that image
            return an Imagex object representing that image
          else if (pathName looks like a bird image name) and
          ((pathName is not a catalog number in ImageCatalog) or
           (pathname does not name a readable image file)) ->
             sys.stdout  +:=  error message
             return None
          else ->
            return None ]
    """

First we disassemble the full path name of the image file, saving its file name (minus the extension) in baseName.

archx.py
    #-- 1 --
    # [ baseName  :=  pathName minus its path component and
    #                 file extension ]
    dirPath, fileName  =  os.path.split ( pathName )
    baseName, extension  =  os.path.splitext ( fileName )

At this writing, all images have one of two file name formats. Bird images have a year-month-day format:

yyyymmddxnn

The nn part is the film frame number. The x character is usually a period, but can be a lowercase letter when there are multiple images on the same day with the same frame number.

Nonbird images are prefixed with the letter “n”:

nyyyymmddxnn        

So at this point in time, the test for whether a file represents a bird image is to see whether it starts with a digit. If not, we can just return None; no error checking is done on nonbird images.

There is one extra wrinkle. If the filename is a “hidden file” starting with '.', baseName will be the empty string. In that case the file is clearly not an image file.

archx.py
    #-- 2 --
    # [ if baseName is nonempty and starts with a letter ->
    #     I
    #   else -> return None ]
    if  ( ( len(baseName) == 0 ) or
          ( not ( baseName[0].isdigit() ) ) ):
        return None

Next we check to see if the image has been cataloged. If not, we write an error message and return None.

archx.py
    #-- 3 --
    # [ if baseName is a catalog number in imageCatalog ->
    #     I
    #   else ->
    #     sys.stdout  +:=  error message
    #     return None ]
    try:
        orig  =  imageCatalog.getOriginal ( baseName )
    except KeyError:
        print "*** Uncataloged: %s" % pathName
        return None

Next we use the Imagex class constructor to read the image file and extract the width and height. See Section 4.8, “class Imagex: An object to describe one image”.

archx.py
    #-- 4 --
    # [ if pathName names a readable, valid image file ->
    #     result  :=  an Image object representing that image
    #   else -> raise IOError ]
    result  =  Imagex ( pathName )

One more task remains: creation of the thumbnail image. The path name of the thumbnail is THUMB_DIR+pathName. Converting the full-sized image to a thumbnail is a single method call on the Image object: the .thumbnail() method takes a 2-tuple specifying the maximum width and height, and the aspect ratio is preserved.

Technically, we have broken the encapsulation of the Imagex object by replacing its .image attribute with a different image. However, since that attribute is not used for anything except writing the index file after this, no harm is done. Also, the steps we would need to take to avoid this are computationally expensive: we would need to make a copy of the entire image (many of which run into the tens of megabytes) before reducing it to a thumbnail.

archx.py
    #-- 5 --
    # [ thumbPath  :=  THUMB_DIR + baseName + THUMB_EXTENSION
    #   result     :=  result with its image replaced by a
    #       thumbnail no larger than (THUMB_WIDE, THUMB_HIGH) ]
    thumbPath  =  "%s%s%s" % (THUMB_DIR, baseName, THUMB_EXTENSION)
    result.image.thumbnail ( (THUMB_WIDE, THUMB_HIGH) )

Finally, assuming that we can, we write the thumbnail image. This could fail, but it falls under the blanket precondition that we have write access to the thumbnail directory. Assuming all that works, we can then return the Imagex result to the caller.

archx.py
    #-- 6 --
    # [ if thumbPath names a file that can be created new ->
    #     that file  :=  result.image with its type determined
    #                    by thumbPath's extension
    #   else -> raise IOError ]
    result.image.save ( thumbPath )

    #-- 7 --
    return result

4.7. writeIndex(): Output the index for one archive directory

This function writes an XML file conforming to the archx.rnc schema (see Section 3, “The archx.rnc schema”).

archx.py
# - - -   w r i t e I n d e x   - - -

def writeIndex ( archDir, imagexList ):
    """Generate the XML index file.

      [ (archNo is an archive directory name as a string) and
        (imagexList is a list of Imagex objects) ->
          index directory  :=  index directory with an 
              arch-(archNo).xml file added representing 
              imagexList ]
    """

The XML generation technique uses the xmlcreate.py module; for more information, see the importation of this module in Section 4.2, “Imports”.

We start by creating the document node. No <!DOCTYPE ...> will be attached to this XML file.

archx.py
    #-- 1 --
    # [ doc   :=  a new DOM Document object with root element
    #             of type RNC_ARCHIVE_INDEX_N ]
    doc  =  xc.Document ( RNC_ARCHIVE_INDEX_N )

Next we add child nodes to the root of this document, one for each element of imagexList. The actual generation of these child nodes is done by the .writeNode() method of the Imagex object; see Section 4.10, “Imagex.writeNode(): Translate self to XML”.

archx.py
    #-- 2 --
    # [ imagexList is a list of Imagex objects ->
    #     doc.root  :=  doc.root with nodes added representing
    #                   those objects ]
    for  imagex in imagexList:
        imagex.writeNode ( doc.root )

Finally, we write the resulting XML file to the index directory. The name of the index file is INDEX_DIR + archDir.

archx.py
    #-- 3 --
    # [ index directory  :=  index directory with an
    #      (archDir+".xml") file added representing doc ]
    fileName  =  "%s%s.xml" % (INDEX_DIR, archDir)
    try:
        indexFile  =  open ( fileName, "w" )
    except IOError, detail:
        print ( "*** Can't open index file '%s' for writing." %
                fileName )
        return
    doc.write ( indexFile )
    indexFile.close()

4.8. class Imagex: An object to describe one image

Each instance of this class holds the information about one image that we have indexed. The class knows how to add that information to a DOM tree for output as XML.

archx.py
# - - - - -   c l a s s   I m a g e x   - - - - -

class Imagex:
    """Represents information about one bird image.

      Exports:
        Imagex ( pathName ):
          [ (pathName is a string) ->
              if pathName names a readable, valid image file ->
                return a new Imagex object representing the image
              else ->
                raise IOError ]
        .pathName:   [ as passed to constructor, read-only ]
        .baseName:
           [ self.pathName, stripped of its directory part and
             extension ]
        .image:      [ the image as an Image.Image object ]
        .wide:       [ width in pixels as an integer ]
        .high:       [ height in pixels as an integer ]
        .writeNode ( parent ):
          [ parent is an xmlcreate.Element ->
              parent  :=  parent with a new RNC_IMAGE_N node added
                  representing self ]
    """

4.9. Imagex.__init__(): Constructor

Given the path name of an image file, we need to find the image size. Here's the constructor interface:

archx.py
# - - -   I m a g e x . _ _ i n i t _ _   - - -

    def __init__ ( self, pathName ):
        """Constructor for Imagex.
        """
        #-- 1 --
        # [ self.pathName  =  pathName
        #   self.baseName  =  pathName, stripped of its directory
        #                     part and extension ]
        self.pathName  =  pathName
        discard, fileName  =  os.path.split ( pathName )
        self.baseName, discard  =  os.path.splitext ( fileName )

Python's Image() module does all the heavy lifting for us here. For documentation on this module, see Python imaging library (PIL).

This module's Image.open() method will raise an IOError exception in two different cases: if the file is inaccessible or nonexistent; and if the file does not represent one of the image formats supported by the Image module. In either case, we pass the exception back to our caller. If the file is readable and valid, we get back an Image object.

archx.py
        #-- 2 --
        # [ if pathName names a readable, valid image file ->
        #     pic  :=  an Image object representing that image
        #   else ->
        #     raise IOError ]
        self.image  =  Image.open ( pathName )

The .size attribute of this object is a 2-tuple (width, height).

archx.py
        #-- 3 --
        # [ self.size gives (width,height) in pixels ->
        #     self.wide  :=  that width as mm
        #     self.high  :=  that height as mm ]
        self.wide, self.high  =  self.image.size

4.10. Imagex.writeNode(): Translate self to XML

This method adds a representation of itself as an <image> element to the DOM tree.

archx.py
# - - -   I m a g e x . w r i t e N o d e   - - -

    def writeNode ( self, parent ):
        """Write an RNC_IMAGE_N node representing self.

          [ parent is an xmlcreate.Element object ->
              parent  :=  parent with a new RNC_IMAGE_N node added
                  representing self ]
        """

The xc.Element constructor accepts as an optional third argument a dictionary of attribute names and values. We first build up that dictionary, then xc.Element takes care of building the element and its attributes, and attaching it to the parent.

archx.py
        attrs  =  { RNC_CAT_NO_A: self.baseName,
                    RNC_WIDE_A:   str ( self.wide ),
                    RNC_HIGH_A:   str ( self.high ) }
        child  =  xc.Element ( parent, RNC_IMAGE_N, **attrs )

4.11. Epilogue

The last lines of the script execute the main() function, assuming that the script is being executed (not imported).

archx.py
#================================================================
# Epilogue
#----------------------------------------------------------------

if  __name__ == "__main__":
    main()