Next / Previous / Contents / Shipman's homepage

9.4. KwicIndex.__findKeywords(): Find all the keywords in a line

The purpose of this method is to find all the character groups in a line that have the pattern of keywords: they start with a keyword start character followed by zero or more keyword characters.

kwic.py
# - - -   K w i c I n d e x . _ _ f i n d K e y w o r d s

    def __findKeywords(self, s):
        '''Find all the keywords in the given string.

          [ s is a string ->
              generate (start,end) tuples bracketing the keywords
              in s such that each keyword is found in s[start:end] ]
        '''

We will use a simple state machine to process the line. The variable start will be initially set to None, and will mark the starting position of each keyword. We walk through the line, examining each character.

  1. If this is the transition between a non-keyword and a keyword (at a keyword start character), set start to the current position.

  2. If this is the transition between a keyword and a non-keyword, generate the tuple bracketing the keyword, and set start back to None.

kwic.py
        #-- 1 --
        start = None

        #-- 2 --
        # [ if s ends with a keyword ->
        #     start  :=  starting position of that keyword
        #     generate (start, end) tuples bracketing any keywords
        #     that don't end (s)
        #   else ->
        #     generate (start, end) tuples bracketing any keywords
        #     that don't end (s) ]
        for i in range(len(s)):
            #-- 2 body --
            # [ if (start is None) and
            #   (s[i] is a start character) ->
            #     start  :=  i
            #   else if (start is not None) and
            #   (s[i] is not a word character) ->
            #     yield (start, i)
            #     start  :=  None
            #   else -> I ]
            if start is None:
                if self.__isStart(s[i]):
                    start = i
                else:
                    pass
            elif not self.__isWord(s[i]):
                yield (start, i)
                start = None

After inspecting all the characters, if start is not None, the line ended with a keyword; generate the bracketing tuple.

kwic.py
        #-- 3 --
        if start is not None:
            yield (start, len(s))

        #-- 4 --
        raise StopIteration