Next / Previous / Contents / TCC Help System / NM Tech homepage

51.10. scanDateGroup(): Process date

This function processes the date group of the log line. Its argument is the date group (without its enclosing square brackets); the return value is the date as a datetime.datetime instance.

Note

We ignore the zone correction given in the record, because time.mktime() expects a local time. This may actually cause a record to be assigned to the wrong day during daylight time transitions. If that is a problem, extract the zone correction along with the other times, convert it to GMT, and then subtract time.timezone to re-correct it to local civil time.

We will use a fairly large regular expression to syntax-check the input and break it into its component parts. Here's an example of an actual log file timestamp:

08/Feb/2009:04:02:54 -0700

Here is the regular expression, annotated:

pageget.py
# - - -   s c a n D a t e G r o u p   - - -

DOM_CODE     =  "D"          # Day of month field
MON_CODE     =  "M"          # Month field (e.g., "Jan")
YYYY_CODE    =  "Y"          # Year field
HOUR_CODE    =  "h"          # Hour field
MIN_CODE     =  "m"          # Minutes field
SEC_CODE     =  "s"          # Seconds field
TZSIGN_CODE  =  "c"          # Sign of zone correction
TZHH_CODE    =  "Z"          # Hours part of zone correction
TZMM_CODE    =  "z"          # Minutes part of zone correction

datePat  =  re.compile (
    r'(?P<%s>'              # Start DOM_CODE group
      r'\d{2}'              # Day of month
    r')'
    r'/'                    # Slash separator
    r'(?P<%s>'              # Start MON_CODE group
      r'[a-zA-Z]{3}'        # Three-letter month code
    r')'
    r'/'
    r'(?P<%s>'              # Start YYYY_CODE group
      r'\d{4}'              # Four-letter year
    r')'
    r':'                    # Colon separator
    r'(?P<%s>'              # Start HOUR_CODE group
      r'\d{2}'              # Hour
    r')'
    r':'                    # Colon separator
    r'(?P<%s>'              # Start MIN_CODE group
      r'\d{2}'              # Minute
    r')'
    r':'                    # Colon separator
    r'(?P<%s>'              # Start SEC_CODE group
      r'\d{2}'              # Second
    r')'
    r' '                    # Matches one space
    r'(?P<%s>'              # Start TZSIGN_CODE group
      r'[\-+]'              # Matches '+' or '-'
    r')'
    r'(?P<%s>'              # Start TZHH_CODE group
      r'\d{2}'              # Two digits
    r')'
    r'(?P<%s>'              # Start TZMM_CODE group
      r'\d{2}'              # Two digits
    r')'
    % ( DOM_CODE, MON_CODE, YYYY_CODE, HOUR_CODE, MIN_CODE,
        SEC_CODE, TZSIGN_CODE, TZHH_CODE, TZMM_CODE ) )

We'll also need a dictionary to translate month names into month numbers:

pageget.py
monthDict  =  { "jan":  1,  "feb":  2,  "mar":  3,  "apr":  4,
                "may":  5,  "jun":  6,  "jul":  7,  "aug":  8,
                "sep":  9,  "oct": 10,  "nov": 11,  "dec": 12 }

The function header and intended function:

pageget.py
def scanDateGroup ( dateGroup ):
    """Extract the access timestamp from the raw dateGroup

      [ dateGroup is a string ->
          if dateGroup is a valid date/time ->
            return that date as an epoch time
          else ->
            raise ValueError ]
    """

First we apply the regular expression to the argument, and then extract the subfields into six local variables. If the match fails, we consider the line badly formed.

pageget.py
    #-- 1 --
    # [ if dateGroup matches datePat ->
    #     m  :=  a Match object for that match
    #   else ->
    #     raise ValueError ]
    m  =  datePat.match ( dateGroup )
    if  m is None:
        raise ValueError, "Bad date format: '%s'" % dateGroup

    #-- 2 --
    yyyy  =  int ( m.group ( YYYY_CODE ) )
    mon   =  m.group ( MON_CODE )
    dom   =  int ( m.group ( DOM_CODE ) )
    hh    =  int ( m.group ( HOUR_CODE ) )
    mm    =  int ( m.group ( MIN_CODE ) )
    ss    =  int ( m.group ( SEC_CODE ) )
    rawSign  =  m.group ( TZSIGN_CODE )
    tzhh  =  int ( m.group ( TZHH_CODE ) )
    tzmm  =  int ( m.group ( TZMM_CODE ) )

Next, translate the month name into a month number using the monthDict mapping.

pageget.py
    #-- 3 --
    # [ if (mon is a valid three-character month code) ->
    #     monthNo  :=  the corresponding month number
    #   else -> raise ValueError ]
    try:
        monthNo    =  monthDict [ mon.lower() ]
    except:
        raise ValueError, ( "Unknown month code '%s'" % mon.lower() )

The zone correction for our FixedTimeZone class (see Section 51.2, “class FixedTimeZone) is expressed in minutes. Combine the sign (rawSign), hours (tzhh), and minutes (tzmm). Since this application will probably never run anywhere but Mountain Time, you may think that this kind of generality is overkill, but who will be laughing next time the earth's poles shift?

pageget.py
    #-- 4 --
    # [ if rawSign is "-" ->
    #     zone  :=  a FixedTimeZone instance representing
    #               (tzhh) hours and (tzmm) minutes west of UTC
    #   else ->
    #     zone  :=  a FixedTimeZone instance representing
    #               (tzhh) hours and (tzmm) minutes east of UTC ]
    zoneName  =  "%s%02d%02d" % (rawSign, tzhh, tzmm)
    if rawSign == "-":
        mmEast  =  - ( tzhh * 60 + tzmm )
    else:
        mmEast  =  tzhh * 60 + tzmm
    zone  =  FixedTimeZone ( mmEast, zoneName )

Now we have all the pieces we need to assemble a zone-aware datetime.datetime instance. The seventh argument is microseconds; the eighth is the zone correction.

pageget.py
    #-- 5 --
    return datetime.datetime ( yyyy, monthNo, dom, hh, mm, ss, 0, zone )