Next / Previous / Contents / TCC Help System / NM Tech homepage

2. Overview

The script is written in the Python language. For more information about Python, see Python 2.2 quick reference.

These source files comprise the webstats.py application:

2.1. Data structures

The pageget.py module defines a class PageGet, each instance of which represents one line in the Apache server's access_log file.

What we need to know from the access logs boils down to three data for each page accessed in the time period of interest. We'll encapsulate these three items as instances of class HitCount:

  • .nTotal: The total number of hits on the page.

  • .nFar: The number of hits from off-campus, so that we can compute the percentage of off-campus hits.

  • .url: The page's URL, minus the “http://infohost.nmt.edu” prefix.

The HitCount class will include a .__cmp__() method that sorts instances in hit-parade order: descending order by their .nTotal attribute, with their .url attribute in ascending order as a secondary key.

We'll need a container class to hold the HitCount instances for each URL accessed in the report period. The structure and function of this container is driven by the needs of the script:

  • In order to generate the overall summary block on the root page, we'll need to remember the range of timestamps included in the report, and the total number of on- and off-campus hits.

  • For the access report on the NMT homepage, we'll need a HitCount instance for URL “/”.

  • For the hit parade, we'll need to visit the HitCount instances for every URL accessed in the report period, sorted as defined by the HitCount.__cmp__() method: in descending order by hit count, with URL as a secondary key.

  • To generate the tables of personal and official pages by their first character, we'll need to interrogate the first character values that occur in both those categories.

  • To generate the pages showing all the personal and official directories for a given first character, we'll need to interrogate the list of all personal and official URLs that start with a given character.

  • To generate the access report pages for a single person or official directory, we'll need to interrogate the list of all URLs that start with given first component (either “/~login” or “/directory”).

  • Finally, to produce the detail lines on the individual access report pages, we'll need to be able to retrieve the HitCount instance for any specified URL.

We'll encapsulate all these data and methods as a single instance of class AccessSummary. This instance will accept a stream of PageGet instances, filter out the ones that don't matter (such as image files and CSS stylesheets), and store the access data as a set of HitCount instances in such a way that they can be retrieved by all the access methods described above.

An earlier version of this program used external files and an ancient sort-merge algorithm that dates from the 1960s. However, starting with this version, we can probably fit into memory everything we need. Preliminary testing in January 2009 showed that the number of distinct URLs to be managed is on the order of 100,000. Even with an average URL length of 50 characters or so, that's a minimal memory footprint on the order of 5MB, hardly a strain on today's multi-gigabyte processor memories.

Python's __slots__ feature, part of the features of new-style classes, allows us to reduce the memory requirements for each of the HitCount instances. The named slots will be .nTotal, .nFar, and .url.