As described in the previous section, this function breaks the access log line into its four major parts using a mixed-strategy parse.
The regular expression that breaks off the first two groups
is declared globally as a compiled regular expression
object FRONT_RE. Python's policy
of merging syntactically adjacent string constants is
convenient here, allowing us to document each piece of the
expression. Python raw strings (r'...') allow us to use backslashes without
having to escape them. The constants whose names end with
_GROUP are symbolic group names
used in the regular expression.
# - - - s c a n G r o u p s - - -
ACCESSOR_GROUP = "a" # Group ID for accessor group
DATE_GROUP = "d" # Group ID for date group
FRONT_RE = re.compile ( # Matches first two groups
r'^' # Start-of-line anchor
r'(?P<%s>' # Start ACCESSOR_GROUP
r'[^\[]+' # Everything up to the next '['
r')' # End ACCESSOR_GROUP
r'\[' # Open bracket for the date group
r'(?P<%s>' # Start DATE_GROUP
r'[^\]]+' # Everything up to the next ']'
r')' # End DATE_GROUP
r'\] ' % # Trailing bracket and space
(ACCESSOR_GROUP, DATE_GROUP) )
Next, we have the actual function.
def scanGroups ( rawLine ):
"""Break an access_log line down into its major groups.
[ if rawLine is a valid access_log line at the group level ->
return (accessor group, date group, cmd group, tail group)
as a sequence of strings
else ->
raise ValueError ]
"""
We start by applying the FRONT_RE
regular expression to the entire line, to break off the
first two groups. In the match
object m, method .group(
returns the text that matched the group with name
N),
and method N.end() returns the
position of the character after the match.
#-- 1 --
# [ if rawLine starts with a pattern that matches FRONT_RE ->
# accessorGroup := group ACCESSOR_GROUP from the match
# dateGroup := group DATE_GROUP from the match
# rest := rawLine beyond the match
# else -> raise ValueError ]
m = FRONT_RE.match ( rawLine )
if m is None:
raise ValueError, "access_log group syntax"
else:
accessorGroup = m.group ( ACCESSOR_GROUP )
dateGroup = m.group ( DATE_GROUP )
rest = rawLine[m.end():]
Next, the command group is removed using a special function that reads a quote string, allowing for escaped quotes.
#-- 2 --
# [ if rest starts with a double-quoted strings, possibly
# including escaped double-quote characters ->
# commandGroup := contents of that string (with escaped
# quotes unescaped)
# tailGroup := rest, past that string
# else -> raise ValueError ]
commandGroup, tailGroup = scanQuoted ( rest )
All that remains is to return the sequence of the four group contents.
#-- 3 --
return (accessorGroup, dateGroup, commandGroup, tailGroup)