Next / Previous / Contents / TCC Help System / NM Tech homepage

51.5. Breaking down the log line: a mixed-strategy parse

See the specification for complete details of the format of the access log file, and where these files live.

In general the access log line can be divided into four groups:

A previous version of this program used one large regular expression to break the line into these four groups. The problem with that approach is that the command group is enclosed in double-quotes "...", but occasionally there may an escaped double quote (\") inside that string.

It may be possible to use regular expressions to handle this complication, but the author considers that a little too much like rocket science. Instead, the disassembly of the log line proceeds using “mixed strategy parsing:” some of the disassembly uses regular expressions, but some uses more ad-hoc methods. Here's the general flow:

  1. The accessor information consists of four or more blank-delimited fields—if there are multiple accessors, the accessors are separated by one comma and one space.

  2. The date group is enclosed within square brackets ([...]). We can use a single regular expression to describe the accessor group as everything up to the "[", and the date group as everything from there up to the "]". This gives us the first two major groups, and what remains is passed on to the next step.

  3. What remains after the removal of the date group is the double-quoted string followed by one space and the tail group. This portion is removed by a small routine that scans for an unescaped closing quote. Any escaped quotes found in the string are returned as part of the content.

  4. The part remaining after the previous step is the tail section.

These four steps are handled in the next function, scanGroups().