Next / Previous / Shipman's Home Sweet Homepage / Site map

## A system for representing taxonomic nomenclature

John W. Shipman, john@nmt.edu
Zoological Data Processing
507 Fitch Avenue NW
Socorro, NM 87801
(505) 835-0235
Homepage: http://www.nmt.edu/~shipman

### 1. Introduction

This document describes a system for representing bird phylogenies, that is, taxonomic arrangements of bird types, as computer files.

• This system is specifically designed to provide a nomenclatural infrastructure for data involving North American birds. It was specifically invented to handle Christmas Bird Count census data, but should have general application in other kinds of North American bird records work.
• A six-letter code system is supported. These abbreviations are relatively static, so they can be used with reasonable safety in encoding field records, even records from many decades ago.
• A numeric key is also provided that will sort records into phylogenetic order. This numeric key is entirely dependent on the preferred systematic arrangement, and since such arrangements are never static, this key should not be used for encoding field records. It is provided only as a convenience in sorting, and should be discarded immediately after the sorting has been done.

A taxonomic arrangement is represented as a set of three text files:

• The ranks file describes the set of taxonomic ranks (or levels or aggregates) of interest to the applications programs. The AOU Check-List describes a number of taxa such as subclasses and superorders that are not usually of interest, so those aggregates can be omitted from the ranks file. If different applications have different needs, they can provide different versions of the ranks file. For example, some applications might wish to use subfamily rank, while other applications might not.
• The standard forms file describes all the standard taxa, that is, all those in the approved arrangement. A typical source for this file is the AOU Check-List.
• The alternate forms file enumerates all names and identifiable forms other than the standard taxa. Some are obsolete names (e.g., rather than Cardinal,'' the preferred name is now Northern Cardinal''), some are names for proper subsets of species (e.g., races or color morphs like Eurasian Green-winged Teal or Blue Goose), and some are names for larger aggregates (e.g., hawk spp.'').

The next sections describe the format of these raw files. A final section describes a program to combine these files into a set of product files representing the entire taxonomy and all the codes.

### 2. Preparing the ranks file

The AOU Check-List defines a lot more taxonomic ranks than most applications will care about, so the ranks file allows the application to specify which ranks are of interest. To prepare this file, use a text editor to enumerate the ranks in descending order, starting with the rank of the root taxon of the arrangement.

Each line defines one rank. Enter these items in order:

• A two-character code for the taxonomic rank, such as -f for Subfamily. If the code is only one character long (e.g., f for family), it should be placed in the first column, with a space in the second column.
• In the third column, put a space if the rank is mandatory, that is, if every lower taxon must be placed in such a rank. For ranks that are not always used (such as Subfamily in the AOU Check-List), enter a question mark (?) in this column to indicate that the rank is optional.
• In the fourth column, specify the number of digits to be allocated in the taxonomic key for this rank. This value should be 1 if there are never more than 9 of this taxon in the next higher one; 2 if there are no more than 99; and so on.
• On the remainder of the line, enter the name of the taxonomic rank, such as Genus.

Here is a sample ranks file; the "_" character represents a space.

c__1Class
o__2Order
f__2Family
-f?1Subfamily
g__2Genus
s__2Species
x__2Form

The numbers in this example allow for up to 99 orders per class, 99 families per order, 9 subfamilies per family, and so on.

The last three lines of this file use codes that do not actually appear in the standard forms file:

• Code g is used for genus.
• Code s is for species.
• Code x is for identifiable forms that are proper subsets of species, such as races or morphs.

Applications that do not wish to track subspecific forms should use a version of the ranks file that does not contain the x line. Omitting the g and s lines is not recommended.

The order is important---always order the ranks from largest to smallest, as in the example above. The program doesn't know anything about taxonomic traditions. If you would like to create your own new ranks, like Infrasupertribes, go right ahead.

### 3. Preparing the standard forms (.std) file

The first input file you must prepare is the standard forms file. This file enumerates all the taxa defined in your preferred standard arrangement. Give this file a name of the form f.std where f is some name suggesting the name of the authority. For example, a file containing names from the AOU Check-List, 6th ed., including all supplements through the 40th, might be named aou640.std.

Place each taxon on a separate line in the standard forms file. The taxa appear in the order in which they are presented in a checklist. The highest taxon appears first, followed by the first contained taxon, and so on down to the first species. The remaining species in that genus follow; then come the other genera in the family, and so on.

Some taxa are defined implicitly. In particular, there is no separate line for genera, since species are identified by binomials---genera are declared implicitly by their first use in a binomial.

There are two types of record in the standard forms file:

• Each higher taxon record represents a taxon above the generic rank. All such records start with a nonblank character.
• Each species record represents a single species in the standard checklist. These records start with a blank.

Records in the standard forms file start with three fixed columns, with the remainder of the record in a variable-length format:

1. The first two columns are the code for the taxonomic rank. Any one- or two-character code may appear here, but one-letter codes must be left-justified and padded with a space (_). Here are the codes used for one representation of the AOU Check-List:
        c_   Class
-c   Subclass
+o   Superorder
o_   Order
-o   Suborder
+f   Superfamily
f_   Family
-f   Subfamily
t_   Tribe
__   Species
2. The third column defines the status of the bird. This column is normally blank, but can contain a status code---a question mark (?) to indicate that the species is of questionable occurrence in this checklist, or a plus sign (+) for species that are extinct.

The exact structure of the tail'' of the record (that is, the variable-length part that follows the first three columns) depends on whether the record describes a higher taxon or a species.

#### 3.1 The higher-taxon record

Place each higher taxon on a separate line, following these steps:

1. Type the two-letter rank code, as defined in the ranks file (see the section above). If the code has only one letter, enter the letter followed by one space.
2. In the third column, enter the status code. This is usually one space, but encode it as ? for dubious taxa or + for extinct taxa.
3. Enter the scientific name of the taxon.
4. Type one slash (/), followed by the English name of the taxon.

Here are some examples of higher-taxon records:

        c  Aves/Birds
-c Neornithes/True Birds
+o Neognathae/Typical Birds
o  Gaviiformes/Loons
f  Gaviidae/Loons
-o Pelecani/Boobies, Pelicans, Cormorants and Darters

#### 3.2 The species record

For each species record, enter these fields on one line:

1. Place two spaces in the first two columns.
2. Enter the one-character status code in the third column. This is normally blank, but may be ? for dubious or + for extinct.
3. Enter the scientific name. The taxon is generally a binomial, but it may include a subgenus as it customarily represented: in parentheses, between the genus and species names.
4. Enter one slash (/), followed by the English name. You may enter multi-word names either in the conventional order (e.g., Wood Duck''), or with the generic part first, followed by a comma and the specific part (e.g., Duck, Wood'').
5. In most cases, you're done. However, if this species is involved in a collision---that is, if it is one of a group of two or more names that abbreviate to the same code according to the rules---enter another slash (/) followed by the disambiguation, that is, the substitute code for this species.

Here are some examples of species lines. The last two show the disambiguation of the collision for code BLAWAR.

___Anas strepera/Gadwall
___Anas penelope/Wigeon, Eurasian
___Haliaeetus pelagicus/Sea-Eagle, Steller's
__?Aerodramus vanikorensis/Gray Swiftlet
___Cygnus (Olor) buccinator/Trumpeter Swan
___Cygnus (Cygnus) olor/Swan, Mute
___Dendroica fusca/Warbler, Blackburnian/BKBWAR
___Dendroica striata/Blackpoll Warbler/BKPWAR

### 4. Preparing the alternate forms (.alt) file

Because field records do not always use the latest names, and because the reported forms are not always standard species, you must prepare an alternate forms'' file enumerating all the forms that have a six-letter code but which are not standard species names.

You must prepare an .alt file for each .std file, reflecting the exact lumps, splits, and names of the standard arrangement. The file must be named f.alt, where f is the same prefix as that of the .std file.

For example, if the standard file for the AOU Check-List, 6th. ed., including supplements through the 40th, is called aou640.std, the corresponding alternate names file must be called aou640.alt.

In the .alt file you will place several different types of records. Each line starts with the six-letter code being defined, followed by a record type code, and a variable length tail.

#### 4.1 Higher taxon record

For each form above species rank in the hierarchy, enter a line of this format:

1. Enter the six-letter code. If the code is shorter than six letters (e.g., HAWK), right-pad it to length with spaces.
2. Enter one space. This signifies that the record is for a higher taxon.
3. Enter the scientific name of the higher taxon to which this code is referred. This name must be defined in the .std file.
4. Enter one slash (/), then the English name.
5. In most cases, you are done. However, if the English name requires some markup to be represented correctly in typeset output, enter another slash, followed by the English name formatted according to the TeX typesetting system.

In the optional TeX name field, two TeX macros are used:

• The \sp macro takes one argument and formats it in italic followed by sp.'' in Roman type. Here is the TeX definition of this macro:
      \def\sp#1{\itc{#1}\ sp.}%
• The \itc macro formats its argument in italics, followed by the italic correction (\/). Here is its definition:
      \def\itc#1{{\it #1\/}}%

Here are some complete examples of higher-taxon records.

    albatr Diomedeidae/albatross sp.
accipi Accipiter/Accipiter sp./\sp{Accipiter}
laracc Accipiter/large Accipiter sp./large \itc{Accipiter}\ sp.

#### 4.2 Direct equivalent record

For each non-standard code that is the exact equivalent of a standard code, create a record in the alternate forms file with this format:

1. Enter the non-standard code, left-justified in the first six columns.
2. Enter an equal sign (=) in the seventh column. This is the record type code for an exact equivalent.
3. Enter the standard six-letter code for the new name, left-justified in the next six columns.
4. Enter one space, followed by the English name (for annotation purposes).

Examples of direct-equivalent records:

    amboys=blkoys Oystercatcher, American Black
amewid=amewig Widgeon, American
watpip=amepip Pipit, Water

Note: the form after the equal sign must be defined elsewhere in the standard or alternate forms file.

#### 4.3 Subspecific forms record

There are several reasons for assigning codes to forms that are a subset of a standard species:

• Subspecies in the strict taxonomic sense, such as Myrtle Warbler (a subspecies of Yellow-rumped Warbler).
• Color morphs, such as Blue Goose (a morph of Snow Goose).
• Recognizable forms of uncertain taxonomic status, such as Pink-sided Junco (an identifiable form of Dark-eyed Junco).

So we use the term subspecific form'' loosely, to mean any identifiable form that refers to some subset of a standard species. For each such code, enter a line with this format:

1. The six-letter code being defined.
2. A less-than (<) symbol. This is the record type code for a subspecific form record.
3. The six-letter code of the standard species that contains this form.
4. One space, followed by the English name of this form.
5. In most cases you are done. However, if the English name needs TeX markup to appear correctly in typeset output, append a slash, followed by the TeX-encoded English name.

Examples of subspecific form lines:

    agpchi<grpchi Attwater's Greater Prairie-Chicken
agwtea<gnwtea Teal, American Green-winged
axetea<gnwtea teal, (American x European) Green-winged
blugoo<snogoo Blue Goose
branth<brant  Brant (hrota)/Brant (\itc{hrota})

#### 4.4 Collision record

In order to record all the known collisions---that is, cases where two or more names encode to the same six-letter abbreviation according to the rules for abbreviation formation---you must add to the alternate forms file one line for each collision. Each such line enumerates all the disambiguations, that is, the substitute form codes that are preferred:

1. Enter the collision code in the first six columns.
2. Enter a question mark (?) in the seventh column.
3. Type all the disambiguations separated by colon (:) characters.

Examples of collision records:

    barowl?brdowl:cobowl
belspa?bldspa:bllspa
columb?colba :colbid:colbin

The first example shows that two names collide for the code barowl. The forms are Barred Owl (which is given the substitute code brdowl in the standard forms file) and Barn Owl, which is an obsolete name that is equivalent to cobowl, the code for Common Barn-Owl. The last example shows a three-way collision for code columb between the codes for Columba, family Columbidae, and subfamily Columbinae. Note that a collision record may refer to forms other than standard taxa.

The substitute codes referred to may be defined elsewhere in the .alt file, or defined implicitly in the .std file.

### 5. Building the standard product files

Once you have prepared all the input files, you can compile them into a set of standard product files. These product files are all flat files'' that give the same information in a form more immediately usable in database applications.

#### 5.1 The nombuild program

The nombuild program checks the various input files and compiles them into a set of standard product files (described below).

To run this program, change to the directory containing all the input files and type the command:

    nombuild
If there are any problems with the input files, the program will produce error messages on the standard output stream, and also produce a duplicate listing of these errors in file nombuild.log.

If there are no problems, all the product files will be written. These files are:

1. The tree file defines all the taxa in the standard forms file plus all subspecific taxa from the alternate forms file. Its name is the same as the input file, except it has extension .tre. For example, if the input files are aou640.std and aou640.alt, the tree file will be called aou640.tre.
2. The abbreviations file defines all the six-letter bird codes. This file has extension .ab6.
3. The collisions file describes every six-letter bird codes that is invalid because two or more names would all abbreviate to that code. Its extension is .col.

#### 5.2 The tree (.tre) file

The tree file defines all the different scientific names used in the input. Here is the format of that file:

• var.: The first field on each line is the taxonomic key number. The exact format of this field depends on the content of the ranks file; see the discussion below.
• 6: If this taxon has a standard six-letter bird code, that code appears here; otherwise the field is blank.
• 1: For generally accepted forms, this field is blank. If the form is not in the main AOU Check-List, a question mark (?) appears here.
• 36: The next field is the scientific name of the group to which this form is referred, for example, Junco hyemalis. The field is aligned flush left and padded on the right with spaces. For forms not identified to species, the smallest containing taxon is used, e.g., Aves for bird sp.''
Note. For subspecific forms defined in the alternate names file, this field contains the scientific name with a space and an integer appended. For example, in the line for the standard species Snow Goose, this line will have the value Chen caerulescens, while Blue Goose will have Chen caerulescens 1, Blue-Snow intergrade Chen caerulescens 2, and so on.
• 56: The English name of the form appears next, aligned flush left and right-padded with spaces. For multi-word names, the generic part comes first, followed by a comma, one space, and the specific part. Examples:
        Dunlin
Loon, Red-throated
grebe sp.
bird sp.
bird, large sp.
teal, Blue-winged x Cinnamon
Junco, (Gray-headed x Slate-colored) Dark-Eyed
• var.: At the end of the record is a variable-length field containing the English name, encoded for typesetting using TeX markup codes. Use this field to get diacritical marks and correct italicization of generic names.

The taxonomic key number can be used to sort records into taxonomic order. It contains one or more digits for each rank (except for the root rank). The number of digits for each rank is determined by the third column in the ranks file.

For example, if your ranks file looks like the example given above (2-digit order, 2-digit family, 1-digit subfamily, 2-digit genus, 2-digit species, and 2-digit form), each taxonomic key number would have these components:

• The two-digit serial number of the taxonomic order in which this form is placed, or 00 if the birds is not placed into an order (e.g., bird sp.'').
• The two-digit serial number of the taxonomic family within this order, or 00 for forms not placed within a specific family. Note that the sequence of families starts over at 01 again within each order.
• The one-digit serial number of the subfamily within the family, or 0 if the subfamily is unknown.
• The two-digits serial number of the genus within the family, or 00 if the genus is unknown.
• The two-digit serial number of the species within the genus, or 00 if the species is unknown.
• The two-digit serial number of the form within the species, or 00 if the subspecies is unknown.

For example, code daejun (Dark-eyed Junco) might have a taxonomic key number of 21 24 3 47 01 00 (the spaces here are for clarity---they are not actually present in the record). This key would mean that this form is in the 21st order, and in the 24th family within that order, the 3rd subfamily within that family, the 47th genus within that subfamily, and it the first species within that genus.

Other forms that are included within Dark-eyed Junco will have keys 21 24 3 47 01 01, 21 24 3 47 01 02, and so on. Examples of such forms include races such as Gray-headed Junco, hybrids among the different races (e.g., Gray-headed x Slate-colored Junco''), and obsolete names (Northern Junco'').

Note that the taxonomic key number can be used to deduce relationships between form codes. For example, to find out what genus a species is in, just construct a key number that is the same as the species' key number, but with its species number set to 00. Continuing the example above, suppose Gray-headed Junco has this key number:

    21 24 3 47 01 01
Then we can deduce all the higher ranks by substituting zeroes in the appropriate fields:
    21 24 3 47 01 00     is the containing species, Junco hyemalis
21 24 3 47 00 00     is the containing genus, Junco
21 24 3 00 00 00     is the containing subfamily, Emberizinae
21 24 0 00 00 00     is the containing family, Emberizidae
21 00 0 00 00 00     is the containing order, Passeriformes
00 00 0 00 00 00     is the containing class, Aves

#### 5.3 The abbreviations (.ab6) file

The .ab6 file defines all the six-letter bird abbreviations. Each abbreviation is specified by its taxon field, which is a link to the tree file. Fields are:

• 6: The six-letter bird code, uppercased and aligned flush-left in the field with right blanks.
• 32: The taxon to which this code is referred; a link to the same field in the .tre file.
• var.: The English name from which this abbreviation was derived.

Here are examples of lines from an .ab6 file:

    CACGOOBranta canadensis 2             Cackling Goose
CALLINCarpodacus mexicanus            California Linnet

The first is for code CACGOO, derived from the name Cackling Goose,'' and it is the second subspecific form for Branta canadensis, the Canada Goose. The second line is for code CALLIN, derived from the name California Linnet,'' an alternate name for House Finch.

#### 5.4 The collisions (.col) file

The .col file enumerates all the six-letter form codes that are involved in collisions. Each line has this format:

• 6: The code which is invalid because two or more names would abbreviate to that code by the rules.
• 6: A code which is one of the valid codes substituted for collForm. This code is defined in the .ab6 file.

Here is an example showing three records from a .col file. These three lines document the collision between three names for code PASSER. The preferred substitute codes are PASINA (for Passerina), PASINE (for passerine''), and PASR (for Passer):

    PASSERPASINA
PASSERPASINE
PASSERPASR__

Next: Christmas Bird Count: Using the index to counts