This method maintains the invariants on three dictionaries:
.__personalLetterMap, .__personalMap, and .__officialMap.
It ignores accesses to the institute homepage “/”.
# - - - A c c e s s S u m m a r y . _ _ a d d C a t e g o r y
def __addCategory ( self, url ):
'''Add this URL to the personal or official dictionaries.
[ url is an URL as a string ->
if url is "/" or otherwise length 1 ->
I
else if url is for a personal page ->
self.__personalLetterMap +:= entry for the third
character of url
self.__personalMap +:= entry for url
else ->
self.__officialLetterMap +:= entry for the second
character of url
self.__officialMap +:= entry for url ]
'''
The values in all three of these dictionaries are Python
sets. They use the defaultdict class from the standard Python collections module so that we don't have to
worry about creating the set on the first access: we can
just use the set type's .add() method. For more information on defaultdict, see Section 5, “Imported modules”.
The first task is to classify the URL: is it the
institute homepage ("/"), a personal page
starting with "/~", or an official page
starting with "/" not followed by "~"? See the note at the end of this method
regarding a defect in the original version; any URL
of length one is ignored here.
#-- 1 --
if len(url) < 2:
return
else:
maybeTilde = url[1]
If the character following the "/" is a
tilde, it's a personal page; otherwise it is an official
page.
#-- 2 --
# [ if maybeTilde != '~' ->
# self.__officialMap[directory name from url] +:= url
# return
# else -> I ]
if maybeTilde != '~':
#-- 2.1 --
# [ dirName := portion of url[1:] up to the next
# "/" if there is one, or to the end otherwise ]
dirName = url[1:].split('/')[0]
#-- 2.2 --
# [ self._officialMap[dirName] +:= url
# return ]
try:
self.__officialMap[dirName].add(url)
except KeyError:
self.__officialMap[dirName] = set ( [url] )
return
For a personal page, we must add the URL to two sets: one set by first letter, and one set by personal account name.
#-- 3 --
# [ dirName := portion of url[2:] up to the next "/" if
# there is one, or to the end otherwise
# first := url[2] ]
dirName = url[2:].split('/')[0]
first = url[2]
#-- 4 --
# [ self.__personalLetterMap[first] +:= dirName
# self.__personalMap[dirName] +:= url ]
try:
self.__personalLetterMap[first].add ( dirName )
except KeyError:
self.__personalLetterMap[first] = set ( [dirName] )
try:
self.__personalMap[dirName].add ( url )
except KeyError:
self.__personalMap[dirName] = set ( [url] )
In June 2011, a bogus access log record crashed the script. Here is the original code of step 1, above. Here are the events leading to this failure.
The command part of the access log entry looked like this:
"GET HTTP/1.1"
There were two spaces between GET and
HTTP: the URL should have been between
those two spaces.
In Section 51.11, “scanCmdGroup: Process
command group”, the command was
broken on spaces using cmdGroup.split('
'), and the URL was set to the second element
of the result list, which in this case was an empty
string.
In Section 51.12, “cleanURL(): Process the raw URL”, the path part of the
URL was normalized by using the standard library's
os.path.normpath() function. This has
the effect of removing “..” elements
from the path, a necessity for security reasons.
However, when this function is passed an empty string,
it returns '.'.
Here in this method, I made the assumption that
either the URL was just '/' (as the
constant INSTITUTE_HOMEPAGE, defined
in Section 6.3, “Web paths”), or it had a second
character. Here is the original, flawed code block:
#-- 1 --
if url==INSTITUTE_HOMEPAGE:
return
else:
maybeTilde = url[1]
Because the value of url at this point was
'.', the else clause
failed with an IndexError.
The solution is to ignore here any URL that has a length of one.