Next / Previous / Contents / TCC Help System / NM Tech homepage

51.4. scanAccessLine(): Process one access log line

This function attempts to process an access log line. Because one line may represent multiple accesses, for valid lines it returns a list of one or more PageGet objects representing those accesses.

If the line isn't valid, write an error message to stderr and return an empty list.

pageget.py
# - - -   s c a n A c c e s s L i n e   - - -

def scanAccessLine ( rawLine ):
    """Process one access_log line; returns zero or more PageGets.

      [ if rawLine is a valid Apache access log line ->
          return a list of zero or more PageGet instances
          representing the accesses described by rawLine
        else ->
          sys.stderr  +:=  error message
          raise SyntaxError ]
    """

The basic disassembly of the log line into its major parts is done by the scanGroups() function.

pageget.py
    #-- 1 --
    # [ if rawLine looks like a valid access_log line at the group level ->
    #     accessGroup  :=  contents of accessor group as a string
    #     dateGroup    :=  contents of date group as a string
    #     cmdGroup     :=  contents of command group as a string
    #     tailGroup    :=  contents of tail group as a string
    #   else ->
    #     sys.stderr  +:=  error message
    #     return an empty list ]
    try:
        (accessGroup, dateGroup, cmdGroup, tailGroup)  =  scanGroups (
            rawLine )
    except ValueError, detail:
        error ( "Group syntax error, %s: %s\n" % (detail, rawLine) )
        raise SyntaxError

Then we dispatch each of the four major pieces to its own function to check and process a piece. For the definition of “effective host list,” see the specification.

pageget.py
    #-- 2 --
    # [ if accessGroup is a valid host-group ->
    #     accessorList := effective host list from accessGroup
    #     username := username from accessGroup, or "-" if none
    #   else ->
    #     sys.stderr +:= error message
    #     return an empty list ]
    try:
        accessorList, username = scanAccessGroup ( accessGroup )
    except ValueError, result:
        error ( "Host group error, %s: %s" % (result, rawLine) )
        raise SyntaxError

    #-- 3 --
    # [ if dateGroup is a valid date/time ->
    #     when  :=  that date as a datetime.datetime instance
    #   else ->
    #     sys.stderr  +:=  error message
    #     return an empty list ]
    try:
        when  =  scanDateGroup ( dateGroup )
    except ValueError, result:
        error ( "Date group error, %s: %s" % (result, rawLine) )
        raise SyntaxError

This next step removes two special cases. The command may be, instead of GET or POST, some other command such as OPTIONS, HEAD, or PROPFIND. We ignore those cases; they are not page fetches. Also, accesses to URL “//” are treated as accesses to “/”.

pageget.py
    #-- 4 --
    # [ if cmdGroup is a valid command group ->
    #     command  :=  command from cmdGroup
    #     url      :=  URL from cmdGroup
    #   else if the command in cmdGroup is not "GET" or "POST" ->
    #     return an empty list
    #   else ->
    #     sys.stderr  +:=  error message
    #     return an empty list ]
    try:
        command, url  =  scanCmdGroup ( cmdGroup )
        if command not in ("GET", "POST"):
            return []
        if url.startswith ('//'):
            url = url[1:]
    except ValueError, result:
        error ( "Command group error, %s: %s" %
                (result, rawLine) )
        raise SyntaxError

    #-- 5 --
    # [ if tailGroup is a valid tail group ->
    #     status    :=  status from tailGroup
    #   else ->
    #     sys.stderr  +:=  error message
    #     return an empty list ]
    try:
        status  =  scanTailGroup ( tailGroup )
    except ValueError, result:
        error ( "Tail group error, %s: %s" % ( result, rawLine ) )
        raise SyntaxError

At this point we have parsed the whole record, and accessorList contains a list of the accessors. We return a list of PageGet objects, one for each element of accessorList, using a Python list comprehension.

pageget.py
    #-- 6 --
    # [ return a list of PageGet objects for each accessor in
    #   accessorList, using (when, command, url, status)
    #   for the other values ]
    return [ PageGet ( a, username, when, command, url, status)
             for a in accessorList ]