All the machinery that deals with the Apache access log
file (access_log) lives in module pageget.py.
This module contains two items:
An instance of class PageGet
represents one page access. Note that a single
access log entry may describe more than one page access.
Function scanAccessLog() is a
generator that reads an access log file and generates a
stream of PageGet objects.
Here are the opening declarations to the pageget.py module.
"""pageget.py: Functions for the Apache access_log recording page fetches. """
The standard sys library gives us access
to the standard I/O streams; see
http://docs.python.org/library/sys.html.
#================================================================ # IMPORTS #---------------------------------------------------------------- import sys
We'll need the os module to remove redundant
“/” and
“..” elements from
URL paths.
import os
We use the Python regular expression package re to break down the log lines into
their component parts.
import re
The urllib module provides
functions to handle URL encoding and decoding.
import urllib
The datetime module provides
clock and calendar functions, so we can include only
records from a given interval.
import datetime