Next / Previous / Contents / TCC Help System / NM Tech homepage

51.12. cleanURL(): Process the raw URL

This function takes the URL from the log, reverses URL encoding, and removes any CGI arguments.

Although Python's urllib module has a perfectly reasonable function to decode URL encoding, in December 2005, an earlier version of this function led to the script crashing. Here is the story of that crash and the resulting fix.

A certain user, no doubt in Windows, created a URL containing the string “Tönen”. The “ö” character is represented as "\xf6". That character was placed into an XHTML page, built using the Document Object Model (DOM).

Unfortunately, the DOM expects all strings to be encoded using UTF-8. Character code "\xf6" is not valid in UTF-8 strings, so the DOM serializer failed when it tried to convert the string to Unicode using the Python function “unicode(text, "utf-8")”. (For more information about Unicode, see The Unicode Standard, Version 4.0.)

We could fix this by leaving the URL encoded, but that escapes a number of quite reasonable characters, including “~”, so that all URLs of user pages would start “%7Eusername/...”.

A better fix is to go through the characters of the URL and apply URL-encoding only to characters whose codes are not ASCII, that is, whose codes are 0x80 or greater. One advantage of this is that users can paste URLs containing such characters into the browser, and they'll get the correct URL.

pageget.py
# - - -   c l e a n U R L   - - -

def cleanURL ( rawURL ):
    """Remove from an URL any URL-encoding and "?..." tail

      [ if rawURL is a string ->
          return rawURL with URL encoding decoded and minus
          any CGI arguments ]
    """

First we remove the CGI arguments.

pageget.py
    #-- 1 --
    # [ if rawURL contains any "?" characters ->
    #     head  :=  rawURL up to the first "?" character
    #   else ->
    #     head  :=  rawURL ]
    L     =  rawURL.split ( "?" )   # Discard up to first "?"
    head  =  L[0]

Reversing URL decoding is handled by the Python urllib module's unquote() function.

pageget.py
    #-- 2 --
    # [ unquoted  :=  head with URL-encoded characters unquoted ]
    unquoted  =  urllib.unquote ( head )

The next line removes redundant "/" and ".." elements from the URL, then requotes all non-ASCII characters. See Section 51.13, “asciifyString: Encode non-ASCII characters”.

pageget.py
    #-- 3 --
    # [ return unquoted with redundant "/" and ".." groups
    #   removed and all characters with codes >= 0x80
    #   replaced by their URL-encoded forms ]
    return  os.path.normpath ( asciifyString ( unquoted ) )