Next / Previous / Contents / TCC Help System / NM Tech homepage

51.11. scanCmdGroup: Process command group

Of the “command group,” the only part we care about is the command (typically GET or POST) and the URL.

In the vast majority of log lines, there are three parts separated by spaces: the command, the URL, and the protocol. However, there are two infrequent cases that differ:

Hence, our approach does the right thing in those cases:

  1. First we break the URL up on space characters. This gives us a list of at least two elements.

  2. If the last element of this list starts with the string "HTTP", we discard it.

  3. The first element of the resulting list is the command. The remaining elements are reassembled with the intervening blanks to form the URL.

pageget.py
# - - -   s c a n C m d G r o u p   - - -

def scanCmdGroup ( cmdGroup ):
    """Extract the command and URL from the command group

      [ if cmdGroup is a string ->
          if cmdGroup is a valid command group ->
            return (command from cmdGroup, URL from cmdGroup)
          else ->
            raise ValueError ]
    """

First we use .split() to break on spaces.

pageget.py
    #-- 1 --
    # [ wordList  :=  cmdGroup broken up on space characters ]
    wordList  =  cmdGroup.split ( ' ' )

Next, we remove the protocol element from the end, if there is one.

pageget.py
    #-- 2 --
    # [ if wordList[-1] starts with "HTTP" ->
    #     wordList  :=  wordList without its last element
    #   else -> I ]
    if  wordList[-1].startswith ( "HTTP" ):
        wordList.pop()

Finally, we reassemble the URL part and return a tuple containing the command and the URL. We use Section 51.12, “cleanURL(): Process the raw URL” to remove URL-encoding and do other cleanup tasks.

Note

In January 2011, some log entries turned up that had null bytes in them. This crashed the script because the etbuilder.py module translates strings to Unicode, and null bytes cannot be Unicode-encoded. Hence the test below that discards such entries.

In June 2011, some log entries came through that had command group "GET HTTP/1.1", with two spaces after GET—the URL was completely missing. Hence the test for that below as well.

pageget.py
    #-- 3 --
    # [ if url is empty ->
    #     raise ValueError
    #   else ->
    #     decodedURL  :=  url, with URL-encoding decoded, minus any
    #                     "?..." tail ]
    url  =  " ".join ( wordList[1:] )
    if len(url) == 0:
        raise ValueError("The URL is missing: '%s'" % cmdGroup)
    else:
        decodedURL = cleanURL(url)

    #-- 3 --
    # [ if decodedURL contains any null characters ->
    #     raise ValueError
    #   else ->
    #     return (first element of wordList, decodedURL) ]
    if '\x00' in decodedURL:
        raise ValueError("Nulls disallowed: %r" % url)
    else:
        return (wordList[0], decodedURL)