The script is written in the Python language. For more information about Python, see Python 2.2 quick reference.
These source files comprise the webstats.py application:
webstats.py is the
main script.
The pageget.py
module encapsulates the logic for scanning Apache's
access log files. See Section 51, “The pageget.py module: Apache
log file functions”.
The pageget.py module defines a
class PageGet, each instance of which
represents one line in the Apache server's access_log file.
What we need to know from the access logs boils down to
three data for each page accessed in the time period of
interest. We'll encapsulate these three items as
instances of class HitCount:
.nTotal: The total number of hits on
the page.
.nFar: The number of hits from
off-campus, so that we can compute the percentage of
off-campus hits.
.url: The page's URL, minus the
“http://infohost.nmt.edu” prefix.
The HitCount class will include a .__cmp__() method that sorts instances in
hit-parade order: descending order by their .nTotal attribute, with their .url attribute in ascending order as a secondary key.
We'll need a container class to hold the HitCount instances for each URL accessed in the
report period. The structure and function of this
container is driven by the needs of the script:
In order to generate the overall summary block on the root page, we'll need to remember the range of timestamps included in the report, and the total number of on- and off-campus hits.
For the access report on the NMT homepage, we'll need
a HitCount instance for URL
“/”.
For the hit parade, we'll need to visit the HitCount instances for every URL accessed in
the report period, sorted as defined by the HitCount.__cmp__() method: in descending
order by hit count, with URL as a secondary key.
To generate the tables of personal and official pages by their first character, we'll need to interrogate the first character values that occur in both those categories.
To generate the pages showing all the personal and official directories for a given first character, we'll need to interrogate the list of all personal and official URLs that start with a given character.
To generate the access report pages for a single
person or official directory, we'll need to
interrogate the list of all URLs that start with
given first component (either “/~”
or “login/”).
directory
Finally, to produce the detail lines on the
individual access report pages, we'll need to be able
to retrieve the HitCount instance for
any specified URL.
We'll encapsulate all these data and methods as a single
instance of class AccessSummary. This
instance will accept a stream of PageGet
instances, filter out the ones that don't matter (such as
image files and CSS stylesheets), and store the access
data as a set of HitCount instances in
such a way that they can be retrieved by all the access
methods described above.
An earlier version of this program used external files and an ancient sort-merge algorithm that dates from the 1960s. However, starting with this version, we can probably fit into memory everything we need. Preliminary testing in January 2009 showed that the number of distinct URLs to be managed is on the order of 100,000. Even with an average URL length of 50 characters or so, that's a minimal memory footprint on the order of 5MB, hardly a strain on today's multi-gigabyte processor memories.
Python's __slots__ feature, part of the
features of new-style classes, allows us to reduce the
memory requirements for each of the HitCount instances. The named slots will be .nTotal, .nFar, and .url.