See the specification for complete details of the format of the access log file, and where these files live.
In general the access log line can be divided into four groups:
The accessor group describes who is asking for the page.
The date group tells when the request came in.
The command group includes the
GET or POST command word, the URL being
accessed, and the protocol being used.
The tail group includes the result code, the length of the page fetched (if any), and the referring URL (if known).
A previous version of this program used one large regular
expression to break the line into these four groups. The
problem with that approach is that the command group is
enclosed in double-quotes "...",
but occasionally there may an escaped double quote
(\") inside that string.
It may be possible to use regular expressions to handle this complication, but the author considers that a little too much like rocket science. Instead, the disassembly of the log line proceeds using “mixed strategy parsing:” some of the disassembly uses regular expressions, but some uses more ad-hoc methods. Here's the general flow:
The accessor information consists of four or more blank-delimited fields—if there are multiple accessors, the accessors are separated by one comma and one space.
The date group is enclosed within square brackets
([...]). We can use a single
regular expression to describe the accessor group as
everything up to the "[", and
the date group as everything from there up to the
"]". This gives us the first
two major groups, and what remains is passed on to the
next step.
What remains after the removal of the date group is the double-quoted string followed by one space and the tail group. This portion is removed by a small routine that scans for an unescaped closing quote. Any escaped quotes found in the string are returned as part of the content.
The part remaining after the previous step is the tail section.
These four steps are handled in the next function,
scanGroups().