This function processes an access log file, generating
a PageGet instance for every
access.
# - - - s c a n A c c e s s L o g - - -
def scanAccessLog ( logFile ):
"""Read an Apache access_log, generating a stream of PageGet objects.
[ if logFile is a readable file handle ->
logFile := logFile advanced to end of file
sys.stderr +:= messages about lines from logFile that
aren't valid access_log lines, if any
generate a sequence of PageGet objects representing the
valid access_log lines from logFile ]
"""
There shouldn't be any invalid lines in the access log, but
there often are. We'll count them, and if the count is
nonzero, we'll send a message to stderr.
#-- 1 --
errCount = 0
Loop over the lines in the file, generating PageGet objects for accesses from valid lines.
#-- 2 --
for rawLine in logFile:
#-- 2 body --
# [ if rawLine is a valid access_log line ->
# yield a sequence of PageGet objects representing that line
# else ->
# errCount +:= 1 ]
Because one line in the access log may represent multiple
accesses to a page, the scanAccessLine() function returns a list of
PageGet objects, not just a single
object.
If the line isn't valid, error logging is done directly to
stderr, and we get back an empty
list. An empty list is considered an error.
#-- 2.1 --
# [ if rawLine is a valid access_log line ->
# getList := a list of one or more PageGet objects
# representing that line
# else ->
# getList := an empty list
# sys.stderr +:= error message ]
try:
getList = scanAccessLine ( rawLine )
except SyntaxError:
errCount += 1
getList = []
#-- 2.2 --
# [ generate the elements of getList ]
for get in getList:
yield get
Finally, check for errors and write a summary of the count if there were any.
#-- 3 --
# [ if errCount > 0 ->
# sys.stderr +:= message about (errCount) errors
# else -> I ]
if errCount > 0:
error ( "Count of unrecognizeable access_log lines: %d" %
errCount )