Next / Previous / Shipman's Home Sweet Homepage / Site map

A system for representing taxonomic nomenclature

John W. Shipman,
Zoological Data Processing
507 Fitch Avenue NW
Socorro, NM 87801
(505) 835-0235


1. Introduction

This document describes a system for representing bird phylogenies, that is, taxonomic arrangements of bird types, as computer files.

A taxonomic arrangement is represented as a set of three text files:

The next sections describe the format of these raw files. A final section describes a program to combine these files into a set of product files representing the entire taxonomy and all the codes.

2. Preparing the ranks file

The AOU Check-List defines a lot more taxonomic ranks than most applications will care about, so the ranks file allows the application to specify which ranks are of interest. To prepare this file, use a text editor to enumerate the ranks in descending order, starting with the rank of the root taxon of the arrangement.

Each line defines one rank. Enter these items in order:

Here is a sample ranks file; the "_" character represents a space.


The numbers in this example allow for up to 99 orders per class, 99 families per order, 9 subfamilies per family, and so on.

The last three lines of this file use codes that do not actually appear in the standard forms file:

Applications that do not wish to track subspecific forms should use a version of the ranks file that does not contain the x line. Omitting the g and s lines is not recommended.

The order is important---always order the ranks from largest to smallest, as in the example above. The program doesn't know anything about taxonomic traditions. If you would like to create your own new ranks, like Infrasupertribes, go right ahead.

3. Preparing the standard forms (.std) file

The first input file you must prepare is the standard forms file. This file enumerates all the taxa defined in your preferred standard arrangement. Give this file a name of the form f.std where f is some name suggesting the name of the authority. For example, a file containing names from the AOU Check-List, 6th ed., including all supplements through the 40th, might be named aou640.std.

Place each taxon on a separate line in the standard forms file. The taxa appear in the order in which they are presented in a checklist. The highest taxon appears first, followed by the first contained taxon, and so on down to the first species. The remaining species in that genus follow; then come the other genera in the family, and so on.

Some taxa are defined implicitly. In particular, there is no separate line for genera, since species are identified by binomials---genera are declared implicitly by their first use in a binomial.

There are two types of record in the standard forms file:

Records in the standard forms file start with three fixed columns, with the remainder of the record in a variable-length format:

  1. The first two columns are the code for the taxonomic rank. Any one- or two-character code may appear here, but one-letter codes must be left-justified and padded with a space (_). Here are the codes used for one representation of the AOU Check-List:
            c_   Class
            -c   Subclass
            +o   Superorder
            o_   Order
            -o   Suborder
            +f   Superfamily
            f_   Family
            -f   Subfamily
            t_   Tribe
            __   Species
  2. The third column defines the status of the bird. This column is normally blank, but can contain a status code---a question mark (?) to indicate that the species is of questionable occurrence in this checklist, or a plus sign (+) for species that are extinct.

The exact structure of the ``tail'' of the record (that is, the variable-length part that follows the first three columns) depends on whether the record describes a higher taxon or a species.

3.1 The higher-taxon record

Place each higher taxon on a separate line, following these steps:

  1. Type the two-letter rank code, as defined in the ranks file (see the section above). If the code has only one letter, enter the letter followed by one space.
  2. In the third column, enter the status code. This is usually one space, but encode it as ? for dubious taxa or + for extinct taxa.
  3. Enter the scientific name of the taxon.
  4. Type one slash (/), followed by the English name of the taxon.

Here are some examples of higher-taxon records:

        c  Aves/Birds
        -c Neornithes/True Birds
        +o Neognathae/Typical Birds
        o  Gaviiformes/Loons
        f  Gaviidae/Loons
        -o Pelecani/Boobies, Pelicans, Cormorants and Darters

3.2 The species record

For each species record, enter these fields on one line:

  1. Place two spaces in the first two columns.
  2. Enter the one-character status code in the third column. This is normally blank, but may be ? for dubious or + for extinct.
  3. Enter the scientific name. The taxon is generally a binomial, but it may include a subgenus as it customarily represented: in parentheses, between the genus and species names.
  4. Enter one slash (/), followed by the English name. You may enter multi-word names either in the conventional order (e.g., ``Wood Duck''), or with the generic part first, followed by a comma and the specific part (e.g., ``Duck, Wood'').
  5. In most cases, you're done. However, if this species is involved in a collision---that is, if it is one of a group of two or more names that abbreviate to the same code according to the rules---enter another slash (/) followed by the disambiguation, that is, the substitute code for this species.

Here are some examples of species lines. The last two show the disambiguation of the collision for code BLAWAR.

___Anas strepera/Gadwall
___Anas penelope/Wigeon, Eurasian
___Haliaeetus pelagicus/Sea-Eagle, Steller's
__+Camptorhynchus labradorius/Labrador Duck
__?Aerodramus vanikorensis/Gray Swiftlet
___Cygnus (Olor) buccinator/Trumpeter Swan
___Cygnus (Cygnus) olor/Swan, Mute
___Dendroica fusca/Warbler, Blackburnian/BKBWAR
___Dendroica striata/Blackpoll Warbler/BKPWAR

4. Preparing the alternate forms (.alt) file

Because field records do not always use the latest names, and because the reported forms are not always standard species, you must prepare an ``alternate forms'' file enumerating all the forms that have a six-letter code but which are not standard species names.

You must prepare an .alt file for each .std file, reflecting the exact lumps, splits, and names of the standard arrangement. The file must be named f.alt, where f is the same prefix as that of the .std file.

For example, if the standard file for the AOU Check-List, 6th. ed., including supplements through the 40th, is called aou640.std, the corresponding alternate names file must be called aou640.alt.

In the .alt file you will place several different types of records. Each line starts with the six-letter code being defined, followed by a record type code, and a variable length tail.

4.1 Higher taxon record

For each form above species rank in the hierarchy, enter a line of this format:

  1. Enter the six-letter code. If the code is shorter than six letters (e.g., HAWK), right-pad it to length with spaces.
  2. Enter one space. This signifies that the record is for a higher taxon.
  3. Enter the scientific name of the higher taxon to which this code is referred. This name must be defined in the .std file.
  4. Enter one slash (/), then the English name.
  5. In most cases, you are done. However, if the English name requires some markup to be represented correctly in typeset output, enter another slash, followed by the English name formatted according to the TeX typesetting system.

In the optional TeX name field, two TeX macros are used:

Here are some complete examples of higher-taxon records.

    albatr Diomedeidae/albatross sp.
    accipi Accipiter/Accipiter sp./\sp{Accipiter}
    laracc Accipiter/large Accipiter sp./large \itc{Accipiter}\ sp.

4.2 Direct equivalent record

For each non-standard code that is the exact equivalent of a standard code, create a record in the alternate forms file with this format:

  1. Enter the non-standard code, left-justified in the first six columns.
  2. Enter an equal sign (=) in the seventh column. This is the record type code for an exact equivalent.
  3. Enter the standard six-letter code for the new name, left-justified in the next six columns.
  4. Enter one space, followed by the English name (for annotation purposes).

Examples of direct-equivalent records:

    amboys=blkoys Oystercatcher, American Black
    amewid=amewig Widgeon, American
    watpip=amepip Pipit, Water

Note: the form after the equal sign must be defined elsewhere in the standard or alternate forms file.

4.3 Subspecific forms record

There are several reasons for assigning codes to forms that are a subset of a standard species:

So we use the term ``subspecific form'' loosely, to mean any identifiable form that refers to some subset of a standard species. For each such code, enter a line with this format:

  1. The six-letter code being defined.
  2. A less-than (<) symbol. This is the record type code for a subspecific form record.
  3. The six-letter code of the standard species that contains this form.
  4. One space, followed by the English name of this form.
  5. In most cases you are done. However, if the English name needs TeX markup to appear correctly in typeset output, append a slash, followed by the TeX-encoded English name.

Examples of subspecific form lines:

    agpchi<grpchi Attwater's Greater Prairie-Chicken
    agwtea<gnwtea Teal, American Green-winged
    alcgoo<cangoo (Aleutian) Canada Goose
    axetea<gnwtea teal, (American x European) Green-winged
    blugoo<snogoo Blue Goose
    branth<brant  Brant (hrota)/Brant (\itc{hrota})

4.4 Collision record

In order to record all the known collisions---that is, cases where two or more names encode to the same six-letter abbreviation according to the rules for abbreviation formation---you must add to the alternate forms file one line for each collision. Each such line enumerates all the disambiguations, that is, the substitute form codes that are preferred:

  1. Enter the collision code in the first six columns.
  2. Enter a question mark (?) in the seventh column.
  3. Type all the disambiguations separated by colon (:) characters.

Examples of collision records:

    columb?colba :colbid:colbin

The first example shows that two names collide for the code barowl. The forms are Barred Owl (which is given the substitute code brdowl in the standard forms file) and Barn Owl, which is an obsolete name that is equivalent to cobowl, the code for Common Barn-Owl. The last example shows a three-way collision for code columb between the codes for Columba, family Columbidae, and subfamily Columbinae. Note that a collision record may refer to forms other than standard taxa.

The substitute codes referred to may be defined elsewhere in the .alt file, or defined implicitly in the .std file.

5. Building the standard product files

Once you have prepared all the input files, you can compile them into a set of standard product files. These product files are all ``flat files'' that give the same information in a form more immediately usable in database applications.

5.1 The nombuild program

The nombuild program checks the various input files and compiles them into a set of standard product files (described below).

To run this program, change to the directory containing all the input files and type the command:

If there are any problems with the input files, the program will produce error messages on the standard output stream, and also produce a duplicate listing of these errors in file nombuild.log.

If there are no problems, all the product files will be written. These files are:

  1. The tree file defines all the taxa in the standard forms file plus all subspecific taxa from the alternate forms file. Its name is the same as the input file, except it has extension .tre. For example, if the input files are aou640.std and aou640.alt, the tree file will be called aou640.tre.
  2. The abbreviations file defines all the six-letter bird codes. This file has extension .ab6.
  3. The collisions file describes every six-letter bird codes that is invalid because two or more names would all abbreviate to that code. Its extension is .col.

5.2 The tree (.tre) file

The tree file defines all the different scientific names used in the input. Here is the format of that file:

The taxonomic key number can be used to sort records into taxonomic order. It contains one or more digits for each rank (except for the root rank). The number of digits for each rank is determined by the third column in the ranks file.

For example, if your ranks file looks like the example given above (2-digit order, 2-digit family, 1-digit subfamily, 2-digit genus, 2-digit species, and 2-digit form), each taxonomic key number would have these components:

For example, code daejun (Dark-eyed Junco) might have a taxonomic key number of 21 24 3 47 01 00 (the spaces here are for clarity---they are not actually present in the record). This key would mean that this form is in the 21st order, and in the 24th family within that order, the 3rd subfamily within that family, the 47th genus within that subfamily, and it the first species within that genus.

Other forms that are included within Dark-eyed Junco will have keys 21 24 3 47 01 01, 21 24 3 47 01 02, and so on. Examples of such forms include races such as Gray-headed Junco, hybrids among the different races (e.g., ``Gray-headed x Slate-colored Junco''), and obsolete names (``Northern Junco'').

Note that the taxonomic key number can be used to deduce relationships between form codes. For example, to find out what genus a species is in, just construct a key number that is the same as the species' key number, but with its species number set to 00. Continuing the example above, suppose Gray-headed Junco has this key number:

    21 24 3 47 01 01
Then we can deduce all the higher ranks by substituting zeroes in the appropriate fields:
    21 24 3 47 01 00     is the containing species, Junco hyemalis
    21 24 3 47 00 00     is the containing genus, Junco
    21 24 3 00 00 00     is the containing subfamily, Emberizinae
    21 24 0 00 00 00     is the containing family, Emberizidae
    21 00 0 00 00 00     is the containing order, Passeriformes
    00 00 0 00 00 00     is the containing class, Aves

5.3 The abbreviations (.ab6) file

The .ab6 file defines all the six-letter bird abbreviations. Each abbreviation is specified by its taxon field, which is a link to the tree file. Fields are:

Here are examples of lines from an .ab6 file:

    CACGOOBranta canadensis 2             Cackling Goose
    CALLINCarpodacus mexicanus            California Linnet

The first is for code CACGOO, derived from the name ``Cackling Goose,'' and it is the second subspecific form for Branta canadensis, the Canada Goose. The second line is for code CALLIN, derived from the name ``California Linnet,'' an alternate name for House Finch.

5.4 The collisions (.col) file

The .col file enumerates all the six-letter form codes that are involved in collisions. Each line has this format:

Here is an example showing three records from a .col file. These three lines document the collision between three names for code PASSER. The preferred substitute codes are PASINA (for Passerina), PASINE (for ``passerine''), and PASR (for Passer):


Next: Christmas Bird Count: Using the index to counts
See also: The Christmas Bird Count database project
Previous: Christmas Bird Count Database specification
Site map
John W. Shipman,
Last updated: 1999/02/28 22:15:16