This function takes the URL from the log, reverses URL encoding, and removes any CGI arguments.
Although Python's urllib module
has a perfectly reasonable function to decode URL
encoding, in December 2005, an earlier version of this
function led to the script crashing. Here is the story
of that crash and the resulting fix.
A certain user, no doubt in Windows, created a URL
containing the string “Tönen”. The
“ö” character is represented as
"\xf6". That character was
placed into an XHTML page, built using the Document
Object Model (DOM).
Unfortunately, the DOM expects all strings to be encoded
using UTF-8. Character code "\xf6" is not valid in UTF-8 strings, so
the DOM serializer failed when it tried to convert the
string to Unicode using the Python function
“unicode(text, "utf-8")”. (For more information about Unicode, see
The Unicode Standard, Version 4.0.)
We could fix this by leaving the URL encoded, but that
escapes a number of quite reasonable characters,
including “~”, so
that all URLs of user pages would start
“%7E”.
username/...
A better fix is to go through the characters of the URL
and apply URL-encoding only to
characters whose codes are not ASCII, that is, whose
codes are 0x80 or greater.
One advantage of this is that users can paste URLs
containing such characters into the browser, and they'll
get the correct URL.
# - - - c l e a n U R L - - -
def cleanURL ( rawURL ):
"""Remove from an URL any URL-encoding and "?..." tail
[ if rawURL is a string ->
return rawURL with URL encoding decoded and minus
any CGI arguments ]
"""
First we remove the CGI arguments.
#-- 1 --
# [ if rawURL contains any "?" characters ->
# head := rawURL up to the first "?" character
# else ->
# head := rawURL ]
L = rawURL.split ( "?" ) # Discard up to first "?"
head = L[0]
Reversing URL decoding is handled by the Python
urllib module's
unquote() function.
#-- 2 --
# [ unquoted := head with URL-encoded characters unquoted ]
unquoted = urllib.unquote ( head )
The next line removes redundant "/" and
".." elements from the URL, then requotes
all non-ASCII characters. See Section 51.13, “asciifyString: Encode
non-ASCII characters”.
#-- 3 --
# [ return unquoted with redundant "/" and ".." groups
# removed and all characters with codes >= 0x80
# replaced by their URL-encoded forms ]
return os.path.normpath ( asciifyString ( unquoted ) )