Next / Previous / Contents / TCC Help System / NM Tech homepage

37. AccessSummary.__addCategory(): Add URL to appropriate category

This method maintains the invariants on three dictionaries: .__personalLetterMap, .__personalMap, and .__officialMap. It ignores accesses to the institute homepage “/”.

webstats.py
# - - -   A c c e s s S u m m a r y . _ _ a d d C a t e g o r y

    def __addCategory ( self, url ):
        '''Add this URL to the personal or official dictionaries.

          [ url is an URL as a string ->
              if url is "/" or otherwise length 1 ->
                I
              else if url is for a personal page ->
                self.__personalLetterMap  +:=  entry for the third
                    character of url
                self.__personalMap  +:=  entry for url
              else ->
                self.__officialLetterMap  +:=  entry for the second
                    character of url
                self.__officialMap  +:=  entry for url ]
        '''

The values in all three of these dictionaries are Python sets. They use the defaultdict class from the standard Python collections module so that we don't have to worry about creating the set on the first access: we can just use the set type's .add() method. For more information on defaultdict, see Section 5, “Imported modules”.

The first task is to classify the URL: is it the institute homepage ("/"), a personal page starting with "/~", or an official page starting with "/" not followed by "~"? See the note at the end of this method regarding a defect in the original version; any URL of length one is ignored here.

webstats.py
        #-- 1 --
        if len(url) < 2:
            return
        else:
            maybeTilde  =  url[1]

If the character following the "/" is a tilde, it's a personal page; otherwise it is an official page.

webstats.py
        #-- 2 --
        # [ if maybeTilde != '~' ->
        #     self.__officialMap[directory name from url]  +:= url
        #     return
        #   else -> I ]
        if maybeTilde != '~':
            #-- 2.1 --
            # [ dirName  :=  portion of url[1:] up to the next
            #       "/" if there is one, or to the end otherwise ]
            dirName  =  url[1:].split('/')[0]

            #-- 2.2 --
            # [ self._officialMap[dirName]  +:=  url
            #   return ]
            try:
                self.__officialMap[dirName].add(url)
            except KeyError:
                self.__officialMap[dirName] = set ( [url] )
            return

For a personal page, we must add the URL to two sets: one set by first letter, and one set by personal account name.

webstats.py
        #-- 3 --
        # [ dirName  :=  portion of url[2:] up to the next "/" if
        #       there is one, or to the end otherwise
        #   first  :=  url[2] ]
        dirName  =  url[2:].split('/')[0]
        first  =  url[2]

        #-- 4 --
        # [ self.__personalLetterMap[first]  +:=  dirName
        #   self.__personalMap[dirName]  +:=  url ]
        try:
            self.__personalLetterMap[first].add ( dirName )
        except KeyError:
            self.__personalLetterMap[first]  =  set ( [dirName] )

        try:
            self.__personalMap[dirName].add ( url )
        except KeyError:
            self.__personalMap[dirName]  =  set ( [url] )

Note

In June 2011, a bogus access log record crashed the script. Here is the original code of step 1, above. Here are the events leading to this failure.

  1. The command part of the access log entry looked like this:

    "GET  HTTP/1.1"
    

    There were two spaces between GET and HTTP: the URL should have been between those two spaces.

  2. In Section 51.11, “scanCmdGroup: Process command group”, the command was broken on spaces using cmdGroup.split(' '), and the URL was set to the second element of the result list, which in this case was an empty string.

  3. In Section 51.12, “cleanURL(): Process the raw URL”, the path part of the URL was normalized by using the standard library's os.path.normpath() function. This has the effect of removing “..” elements from the path, a necessity for security reasons. However, when this function is passed an empty string, it returns '.'.

  4. Here in this method, I made the assumption that either the URL was just '/' (as the constant INSTITUTE_HOMEPAGE, defined in Section 6.3, “Web paths”), or it had a second character. Here is the original, flawed code block:

            #-- 1 --
            if url==INSTITUTE_HOMEPAGE:
                return
            else:
                maybeTilde  =  url[1]
    

    Because the value of url at this point was '.', the else clause failed with an IndexError.

The solution is to ignore here any URL that has a length of one.