Of the “command group,” the only part we care
about is the command (typically GET or POST) and the URL.
In the vast majority of log lines, there are three parts separated by spaces: the command, the URL, and the protocol. However, there are two infrequent cases that differ:
Sometimes the protocol group may be missing.
The URL may have embedded, unescaped spaces in it. In that case, we need to preserve those spaces. There may be multiple URLs that differ only past the first space.
Hence, our approach does the right thing in those cases:
First we break the URL up on space characters. This gives us a list of at least two elements.
If the last element of this list starts with the
string "HTTP", we discard it.
The first element of the resulting list is the command. The remaining elements are reassembled with the intervening blanks to form the URL.
# - - - s c a n C m d G r o u p - - -
def scanCmdGroup ( cmdGroup ):
"""Extract the command and URL from the command group
[ if cmdGroup is a string ->
if cmdGroup is a valid command group ->
return (command from cmdGroup, URL from cmdGroup)
else ->
raise ValueError ]
"""
First we use .split() to break
on spaces.
#-- 1 --
# [ wordList := cmdGroup broken up on space characters ]
wordList = cmdGroup.split ( ' ' )
Next, we remove the protocol element from the end, if there is one.
#-- 2 --
# [ if wordList[-1] starts with "HTTP" ->
# wordList := wordList without its last element
# else -> I ]
if wordList[-1].startswith ( "HTTP" ):
wordList.pop()
Finally, we reassemble the URL part and return a tuple
containing the command and the URL. We use Section 51.12, “cleanURL(): Process the raw URL” to remove URL-encoding and do other
cleanup tasks.
In January 2011, some log entries turned up that had
null bytes in them. This crashed the script because
the etbuilder.py module translates strings
to Unicode, and null bytes cannot be Unicode-encoded.
Hence the test below that discards such entries.
In June 2011, some log entries came through that had
command group "GET HTTP/1.1", with two
spaces after GET—the URL was
completely missing. Hence the test for that below as
well.
#-- 3 --
# [ if url is empty ->
# raise ValueError
# else ->
# decodedURL := url, with URL-encoding decoded, minus any
# "?..." tail ]
url = " ".join ( wordList[1:] )
if len(url) == 0:
raise ValueError("The URL is missing: '%s'" % cmdGroup)
else:
decodedURL = cleanURL(url)
#-- 3 --
# [ if decodedURL contains any null characters ->
# raise ValueError
# else ->
# return (first element of wordList, decodedURL) ]
if '\x00' in decodedURL:
raise ValueError("Nulls disallowed: %r" % url)
else:
return (wordList[0], decodedURL)