Next / Previous / Contents / Shipman's homepage

4. Using the kwic.py module

Note

All character handled by this module uses Python's unicode type for full Unicode compatibility.

Any value of type str you provide must use UTF-8 encoding. Any text value provided by this module will have type unicode.

Here is the general procedure for using this module:

  1. Import the module and call the KwicIndex constructor to get an empty instance.

  2. Feed this instance the set of lines (strings) to be indexed. Lines may be either type unicode, or type str encoded as UTF-8. With each line that you feed it, you may also pass a value (such as a URL or page number) that will be associated with that line.

  3. Ask the instance to produce the index entries in alphabetical order as a sequence of instance of class KwicEntry.

4.1. Using the KwicIndex class

Here is the interface to the KwicIndex class.

KwicIndex(stopList=None)

Returns a new KwicIndex instance with no references in it. The stopList argument must be a sequence of zero or more stop words. The default value is a stop word list given in Section 13, “The default stop_words file”.

.index(s, userData=None)

The argument s is a line of text. All words in s that are not in the stopList are added to the instance. The instance associates the userData with the line so it can be retrieved later when the index is generated.

.genWords(prefix='')

Generate all the unique keywords in the instance that start with the supplied prefix, as a sequence of KwicWord instances, in ascending order by keyword.