Next / Previous / Contents / Shipman's homepage

3.4. Overall procedure

Here is an outline of the overall program flow.

  1. Read the ranks file and use it to build an instance of the Hier class.

  2. Create an empty Txny instance to hold the taxonomic tree. Create an empty AbTab instance to hold the symbol table for form codes.

  3. Read the .std file. Each line in this file represents at least one taxon, and if the rank of that taxon is found in the Hier instance, we can ignore it.

    Although there are no lines representing genera, the first species in each genus is effectively the line that causes creation of a genus-level taxon in the taxonomic tree, so effectively such lines define two taxa.

    • If the tree is empty, this must be the first line of the .std file; add its taxon as the new root of the taxonomic tree.

    • If the tree is not empty, find the parent taxon of the current line. To do this, traverse the tree from the root, always choosing the last child of each taxon, until we reach a taxon that has no children, or whose children are at the same taxonomic level as the new taxon to be added.

      Having found the appropriate parent taxon, add the new taxon (or taxa, in the case of a new genus) as that parent's new last child.

    • If the current line of the .std file defines a species, add its form code to the symbol table. The form code is derived from the English name by applying the standard rules, unless an explicit disambiguation code follows the scientific name field on the line. The symbol table logic will signal an error if this code duplicates one already added.

  4. Read the .alt file. Each line defines one form code, which is added to the symbol table after various error checks.

  5. We check every entry in the symbol table for validity. Every referenced symbol must have be defined. There must be no cycles in the “references” relation; that is, no cases where, for example, code A is the equivalent of code B which is the equivalent of code C which is the equivalent of code A.

  6. Write the four product files. The .tre file is produced by a depth-first preorder traversal of the taxonomic tree: the root taxon first, then its first child, then its first child's first child, and so forth.

    A single pass through the symbol table is sufficient to produce both the .ab6 and .col files. We visit the entries in ascending order by form code, writing a line to the .ab6 file if it is a valid code or writing a line to the .col file if it is a collision form.

    The XML output file is generated by similar techniques. The four major sections of this file correspond to the Hier instance, the taxonomic tree, the valid entries in the symbol table, and the collision entries in the symbol table. For the XML generation technology, refer to Python XML processing with lxml.