Next / Previous / Contents / TCC Help System / NM Tech homepage

Abstract

Describes a system that generates reports summarizing web page access counts on the New Mexico Tech Computer Center web server, http://www.nmt.edu/.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. Introduction
2. Overview
2.1. Data structures
2.2. Overall program flow
2.3. Navigational considerations
3. The webstats.py module: Main program
4. The webstats.py file: Prologue
5. Imported modules
6. Manifest constants
6.1. EXPIRE_DAYS
6.2. Input file paths
6.3. Web paths
6.4. Constants for HTML generation
6.5. Report label text
7. Main program
8. inputPhase(): Read the access logs
9. readLogFile(): Process one access log
10. buildAllPages(): Generate all output pages
11. addSummaryTable(): Generate the table summarizing all accesses
12. buildHitParade(): Build the hit parade
13. accessReport(): Start a new access report table
14. accessRow(): Add one row to an access report table
15. instituteHomepage(): Access report for “/
16. buildCategoryTable(): Build categories table and all personal and official reports
17. buildPersonalSide(): Letter and personal pages
18. buildLetter(): Build one letter page and related personal pages
19. addPersonalReport(): Generate links and access report for one personal account
20. buildReportPage(): Build one access report page
21. buildOfficialSide(): Build access reports for official directories
22. fatal(): Write a message and stop
23. class AccessSummary: Principal data structure
24. Manifest constants for class AccessSummary
24.1. AccessSummary.EXPIRE_DAYS: Duration of the report interval
24.2. AccessSummary.SYM_DOMAIN: Local domain name, symbolic form
24.3. AccessSummary.IP_DOMAIN: Local domain in dotted form
24.4. AccessSummary.BAD_STATUS_THRESHOLD: Upper limit for status codes
24.5. AccessSummary.IGNORED_EXTENSIONS: File extensions to be ignored
24.6. AccessSummary.SPIDER_STRINGS: Spider detection strings
25. AccessSummary.__init__(): Constructor
26. AccessSummary.addPageGet(): Process one access record
27. AccessSummary.__isRelevant(): Filter out irrelevant access records
28. AccessSummary.__statusFilter(): Filter by status code
29. AccessSummary.__extFilter(): Ignore certain files by extension
30. AccessSummary.__spiderFilter(): Filter out search engine spider accesses
31. AccessSummary.__pwdFilter(): Filter out password-protected pages
32. AccessSummary.__timeFilter(): Filter out expired records
33. AccessSummary.__specialFilter(): Special case filter
34. AccessSummary.FILTER_FUNCTIONS: Collection of filter functions
35. AccessSummary.__addHit(): Register one access
36. AccessSummary.__addUrl(): Register one access in self.__urlMap
37. AccessSummary.__addCategory(): Add URL to appropriate category
38. AccessSummary.getUrl(): Retrieve hit counts for a given URL
39. AccessSummary.genByHits(): Generate the hit parade
40. AccessSummary.genPersonalLetters(): First letters of personal accounts
41. AccessSummary.genPersonals(): Generate accounts with the same first letter
42. AccessSummary.genOfficials(): Generate names of official directories
43. AccessSummary.genPersonUrls(): All URLs for a given person
44. AccessSummary.genOfficialUrls(): All URLs for an official directory
45. class HitCount: Hit counts for one URL
46. HitCount.__init__(): Constructor
47. HitCount.addHit(): Tally one access
48. Hitcount.__cmp__(): Comparator method
49. Epilogue
50. The pageget.py module: Apache log file functions
50.1. Prologue to pageget.py
50.2. class FixedTimeZone
50.3. scanAccessLog(): Scan an access log file
50.4. scanAccessLine(): Process one access log line
50.5. Breaking down the log line: a mixed-strategy parse
50.6. scanGroups(): Top-level disassembly of the log line
50.7. scanQuoted(): Process double-quoted string with escapes
50.8. scanAccessGroup(): Process accessors
50.9. findHostList(): Derive the effective host list
50.10. scanDateGroup(): Process date
50.11. scanCmdGroup: Process command group
50.12. cleanURL(): Process the raw URL
50.13. asciifyString: Encode non-ASCII characters
50.14. asciifyChar(): Escape a non-ASCII character
50.15. scanTailGroup(): Process remaining fields
50.16. class PageGet: Describes one page access
50.17. PageGet.__init__(): Constructor
50.18. PageGet.isFar(): Is this an off-campus accessor?
50.19. PageGet.__str__(): Debug display
50.20. error(): Write a message to stderr
50.21. message(): Send a message to standard error and log file
50.22. A small test driver for PageGet

1. Introduction

This document describes the webstats.py script for reducing and displaying statistics on page accesses from the Tech Computer Center's web server. See the specification for the externals of this script.

The code is documented in the “literate programming” style: the source file contains both the documentation and the script's source code. For more information on literate programming and the tool used to extract the source code, see A source extractor for lightweight literate programming.

This publication is available in Web form and also as a PDF document.