# - - - K w i c I n d e x . _ _ m a k e S t o p S e t def __makeStopSet(self, stopList): '''Build the internal stop list. [ stopList is a sequence of unicode or UTF-8 string values or None -> if stopList is not None -> if stopList contains at least one str that is not valid UTF-8 -> raise UnicodeEncodeError else -> self.__stopSet := a set made from the elements of stopList, converted to Unicode and upshifted else if file stop_words is readable -> self.__stopSet := a set made from the keywords found in that file, as upshifted Unicode else -> self.__stopSet := an empty set ] '''
For the logic that converts UTF-8 encoded strings to
Unicode, see Section 9.3, “
KwicIndex.__makeUni(): Force Unicode
#-- 1 -- # [ if stopList is not None -> # if stopList contains at least one str that is not # valid UTF-8 -> # raise UnicodeEncodeError # else -> # self.__stopSet := a set made from the elements of # stopList, converted to Unicode and upshifted # return # else -> I ] if stopList is not None: self.__stopSet = set ( [ self.__makeUni(s).upper() for s in stopList ] ) return
Bereft of an explicit stop list, we next try to read the default stop list file.
#-- 2 -- # [ self.__stopSet := a new, empty set ] self.__stopSet = set()
We assume that we're reading the file given in Section 13, “The default
stop_words file”, which contains no non-ASCII
characters. There's no reason the user can't
substitute their own file and name it
If some user someday wants to substitute a file that
is encoded as UTF-8, replace the following block
with this logic, taken from the Python Unicode HOWTO:
import codecs stopFile = codecs.open(STOP_FILE_NAME, encoding='utf-8')
#-- 3 -- # [ if file STOP_FILE_NAME can be opened for reading -> # stopFile := that file, so opened # else -> return ] try: stopFile = open ( STOP_FILE_NAME ) except IOError: return
stop_words file has one word per line and nothing
else, since we already have the logic to find all the keywords
in a line in Section 9.4, “
KwicIndex.__findKeywords(): Find all the
keywords in a line”, we can
allow the stop file to have any number of keywords in a line,
and ignore anything else.
#-- 4 -- # [ self.__stopSet +:= Unicode-converted, upshifted # keywords from stopFile ] for line in [ unicode(s) for s in stopFile ]: for (start, end) in self.__findKeywords(line): self.__stopSet.add(line[start:end].upper()) #-- 5 -- stopFile.close()