Next / Previous / Contents / TCC Help System / NM Tech homepage

51.6. scanGroups(): Top-level disassembly of the log line

As described in the previous section, this function breaks the access log line into its four major parts using a mixed-strategy parse.

The regular expression that breaks off the first two groups is declared globally as a compiled regular expression object FRONT_RE. Python's policy of merging syntactically adjacent string constants is convenient here, allowing us to document each piece of the expression. Python raw strings (r'...') allow us to use backslashes without having to escape them. The constants whose names end with _GROUP are symbolic group names used in the regular expression.

pageget.py
# - - -   s c a n G r o u p s   - - -

ACCESSOR_GROUP  =  "a"              # Group ID for accessor group
DATE_GROUP      =  "d"              # Group ID for date group
FRONT_RE  =  re.compile (           # Matches first two groups
    r'^'                  # Start-of-line anchor
    r'(?P<%s>'            # Start ACCESSOR_GROUP
      r'[^\[]+'           #   Everything up to the next '['
    r')'                  # End ACCESSOR_GROUP
    r'\['                 # Open bracket for the date group
    r'(?P<%s>'            # Start DATE_GROUP
      r'[^\]]+'           #   Everything up to the next ']'
    r')'                  # End DATE_GROUP
    r'\] ' %              # Trailing bracket and space
    (ACCESSOR_GROUP, DATE_GROUP) )

Next, we have the actual function.

pageget.py
def scanGroups ( rawLine ):
    """Break an access_log line down into its major groups.

      [ if rawLine is a valid access_log line at the group level ->
          return (accessor group, date group, cmd group, tail group)
          as a sequence of strings
        else ->
          raise ValueError ]
    """

We start by applying the FRONT_RE regular expression to the entire line, to break off the first two groups. In the match object m, method .group(N) returns the text that matched the group with name N, and method .end() returns the position of the character after the match.

pageget.py
    #-- 1 --
    # [ if rawLine starts with a pattern that matches FRONT_RE ->
    #     accessorGroup  :=  group ACCESSOR_GROUP from the match
    #     dateGroup      :=  group DATE_GROUP from the match
    #     rest           :=  rawLine beyond the match
    #   else -> raise ValueError ]
    m  =  FRONT_RE.match ( rawLine )
    if  m is None:
        raise ValueError, "access_log group syntax"
    else:
        accessorGroup  =  m.group ( ACCESSOR_GROUP )
        dateGroup      =  m.group ( DATE_GROUP )
        rest           =  rawLine[m.end():]

Next, the command group is removed using a special function that reads a quote string, allowing for escaped quotes.

pageget.py
    #-- 2 --
    # [ if rest starts with a double-quoted strings, possibly
    #   including escaped double-quote characters ->
    #     commandGroup  :=  contents of that string (with escaped
    #                       quotes unescaped)
    #     tailGroup     :=  rest, past that string
    #   else -> raise ValueError ]
    commandGroup, tailGroup  =  scanQuoted ( rest )

All that remains is to return the sequence of the four group contents.

pageget.py
    #-- 3 --
    return (accessorGroup, dateGroup, commandGroup, tailGroup)