Next / Previous / Contents / TCC Help System / NM Tech homepage

51.3. scanAccessLog(): Scan an access log file

This function processes an access log file, generating a PageGet instance for every access.

pageget.py
# - - -   s c a n A c c e s s L o g   - - -

def scanAccessLog ( logFile ):
    """Read an Apache access_log, generating a stream of PageGet objects.

      [ if logFile is a readable file handle ->
          logFile  :=  logFile advanced to end of file
          sys.stderr  +:=  messages about lines from logFile that
              aren't valid access_log lines, if any
          generate a sequence of PageGet objects representing the
              valid access_log lines from logFile ]
    """

There shouldn't be any invalid lines in the access log, but there often are. We'll count them, and if the count is nonzero, we'll send a message to stderr.

pageget.py
    #-- 1 --
    errCount  =  0

Loop over the lines in the file, generating PageGet objects for accesses from valid lines.

pageget.py
    #-- 2 --
    for rawLine in logFile:
        #-- 2 body --
        # [ if rawLine is a valid access_log line ->
        #     yield a sequence of PageGet objects representing that line
        #   else ->
        #     errCount    +:=  1 ]

Because one line in the access log may represent multiple accesses to a page, the scanAccessLine() function returns a list of PageGet objects, not just a single object.

If the line isn't valid, error logging is done directly to stderr, and we get back an empty list. An empty list is considered an error.

pageget.py
        #-- 2.1 --
        # [ if rawLine is a valid access_log line ->
        #     getList  :=  a list of one or more PageGet objects
        #                  representing that line
        #   else ->
        #     getList  :=  an empty list
        #     sys.stderr  +:=  error message ]
        try:
            getList  =  scanAccessLine ( rawLine )
        except SyntaxError:
            errCount  +=  1
            getList  =  []

        #-- 2.2 --
        # [ generate the elements of getList ]
        for get in getList:
            yield get

Finally, check for errors and write a summary of the count if there were any.

pageget.py
    #-- 3 --
    # [ if errCount > 0 ->
    #     sys.stderr  +:=  message about (errCount) errors
    #   else -> I ]
    if  errCount > 0:
        error ( "Count of unrecognizeable access_log lines: %d" %
                errCount )