Next / Previous / Contents / Shipman's homepage

9.2. KwicIndex.__makeStopSet(): Build the internal stop list

kwic.py
# - - -   K w i c I n d e x . _ _ m a k e S t o p S e t

    def __makeStopSet(self, stopList):
        '''Build the internal stop list.

          [ stopList is a sequence of unicode or UTF-8 string
            values or None ->
              if stopList is not None ->
                if stopList contains at least one str that is not
                valid UTF-8 ->
                  raise UnicodeEncodeError
                else ->
                  self.__stopSet  :=  a set made from the elements of
                      stopList, converted to Unicode and upshifted
              else if file stop_words is readable ->
                self.__stopSet  :=  a set made from the keywords
                    found in that file, as upshifted Unicode
              else ->
                self.__stopSet  :=  an empty set ]
        '''

For the logic that converts UTF-8 encoded strings to Unicode, see Section 9.3, “KwicIndex.__makeUni(): Force Unicode representation”.

kwic.py
        #-- 1 --
        # [ if stopList is not None ->
        #     if stopList contains at least one str that is not
        #     valid UTF-8 ->
        #       raise UnicodeEncodeError
        #     else ->
        #       self.__stopSet  :=  a set made from the elements of
        #           stopList, converted to Unicode and upshifted
        #       return
        #   else -> I ]
        if stopList is not None:
            self.__stopSet = set (
                [ self.__makeUni(s).upper()
                  for s in stopList ] )
            return

Bereft of an explicit stop list, we next try to read the default stop list file.

kwic.py
        #-- 2 --
        # [ self.__stopSet  :=  a new, empty set ]
        self.__stopSet = set()

Note

We assume that we're reading the file given in Section 13, “The default stop_words file”, which contains no non-ASCII characters. There's no reason the user can't substitute their own file and name it stop_words. If some user someday wants to substitute a file that is encoded as UTF-8, replace the following block with this logic, taken from the Python Unicode HOWTO:

import codecs
stopFile = codecs.open(STOP_FILE_NAME, encoding='utf-8')
kwic.py
        #-- 3 --
        # [ if file STOP_FILE_NAME can be opened for reading ->
        #     stopFile  :=  that file, so opened
        #   else -> return ]
        try:
            stopFile = open ( STOP_FILE_NAME )
        except IOError:
            return

Although our stop_words file has one word per line and nothing else, since we already have the logic to find all the keywords in a line in Section 9.4, “KwicIndex.__findKeywords(): Find all the keywords in a line”, we can allow the stop file to have any number of keywords in a line, and ignore anything else.

kwic.py
        #-- 4 --
        # [ self.__stopSet  +:=  Unicode-converted, upshifted
        #       keywords from stopFile ]
        for line in [ unicode(s)
                      for s in stopFile ]:
            for (start, end) in self.__findKeywords(line):
                self.__stopSet.add(line[start:end].upper())

        #-- 5 --
        stopFile.close()