Next / Previous / Contents / Shipman's homepage

3. Theory of KWIC indexing

The purpose of an index is to help a reader find some word or phrase. A properly constructed book or Web site should have an index that will help the reader find material relevant to a large number of different words or phrases.

However, building a proper index for a book is a tedious process that is best performed by a trained indexer who understands the subject matter. The technique of KWIC indexing arose in the 1960s as an attempt to automate indexing. The basic idea is to identify keywords and present them, in alphabetical order, surrounded by their context.

Here's an example. Suppose you want to build an index of words that appear in a list of film titles. For the film title “Driving Miss Daisy”, there will be three index entries, once for each word; we'll call this the classical style of indexing.

Daisy, Driving Miss
Driving Miss Daisy
Miss Daisy, Driving

In the original sense, a KWIC index divides the page vertically in two, with the keywords running along the right side of the dividing line in alphabetical order, and the context shown around the keyword, like this:

Driving MissDaisy
 Driving Miss Daisy
DrivingMiss Daisy

This is called the permuted style because the title is cyclically rotated through the position of each keyword. The kwic.py module can be used to build either the permuted style or the classical style.

Some definitions:

keyword

A contiguous string consisting of one keyword start character followed by zero or more keyword characters.

keyword start character

Any character c for which c.isalpha() is true, or the underbar (_) character.

keyword character

Any keyword start character, digit, or hyphen (-).

stop word

A common word that is not considered significant, such as “a”, “and”, or “the”.

exclusion list

A list of stop words.

prefix

The part of the context that precedes a keyword.

suffix

The context that comes after a keyword.