Next / Previous / Contents / Shipman's homepage

9. class KwicIndex: The entire index

An instance of this class represents the complete KWIC index to all of the lines submitted to it for indexing. Here is the formal interface.

kwic.py
# - - - - -   c l a s s   K w i c I n d e x

class KwicIndex(object):
    '''Represents a keyword index.  Exports:

      Exports:
        KwicIndex(stopList=None):
          [ if stopList is a sequence including at least one
            str that is not valid UTF-8 ->
              raise UnicodeEncodeError
            else if stopList is a sequence of unicode or UTF-8
            stop words ->
              return a new, empty KwicIndex using stopList as its
              stop word list
            else if file stop_words is readable ->
              return a new, empty KwicIndex using the keywords
              found in that file as its stop word list
            else ->
              return a new, empty KwicIndex with no stop words ]
        .index(s):
          [ s is a unicode or UTF-8 string ->
              self  :=  self with all keywords in s added ]
        .genWords(prefix=''):
          [ prefix is a unicode or UTF-8 string ->
              generate all the unique keywords in self that start
              with prefix as a sequence of KwicWord instances, in
              ascending order by upshifted keyword ]

      State/Invariants:
        .__stopSet:
          [ a set containing words in the stop list as upshifted
            unicode ]
        .__skip:
          [ an instance of pyskip.SkipList representing the
            keyword occurrences in self as SkipWord instances,
            ordered according to SkipWord.__cmp__ ]
    '''

9.1. KwicIndex.__init__(): Constructor

The constructor has two jobs: set up the empty skip list and set up the stop word list.

kwic.py
# - - -   K w i c I n d e x . _ _ i n i t _ _

    def __init__(self, stopList=None):
        '''Constructor.
        '''
        #-- 1 --
        # [ self.__skip  :=  a new, empty pyskip.SkipList instance
        #       that uses KwicWord.getKey as a key extractor ]
        self.__skip = pyskip.SkipList(keyFun=KwicWord.getKey)

Internally, the stop word list is a Python set named self.__stopSet whose members are uppercased Unicode. If the effective argument value is None, we'll try to read the default stop file, but if it's not there, make the set empty. See Section 9.2, “KwicIndex.__makeStopSet(): Build the internal stop list”.

kwic.py
        #-- 2 --
        # [ if (stopList is not None) and (stopList includes at
        #   least one str that is not UTF-8) ->
        #     raise UnicodeEncodeError
        #   else if stopList is not None ->
        #     self.__stopSet  :=  a set made from the elements of
        #         stopList, converted to Unicode and upshifted
        #   else if file stop_words is readable ->
        #     self.__stopSet  :=  a set made from the keywords
        #         found in that file, as upshifted Unicode
        #   else ->
        #     self.__stopSet  :=  an empty set ]
        self.__makeStopSet(stopList)