Next / Previous / Contents / Shipman's homepage

5.11. Forward: The parser placeholder

pp.Forward()

This one is going to a little complicated to explain.

There are certain parsing situations where you can't really write correct BNF that describes what is correctly structured and what is not. For example, consider the pattern “two copies of the same digit, one after the other.” For example, the strings '33' and '99' match that description.

Writing a pattern that matches one digit is easy: “pp.Word(pp.nums, exact=1)” works perfectly well. And two of those in succession would be “pp.Word(pp.nums, exact=1) + pp.Word(pp.nums, exact=1)”.

However, although that pattern matches '33', it also matches '37'. So how would your script specify that both the pieces matched the same digit?

In pyparsing, we do this with an instance Forward class. Such an instance is basically an empty placeholder where a pattern will be added later, during execution of the parsing.

Let's look at a complete script that demonstrates use of the Forward pattern. For this example, we will reach back to ancient computing history for a feature of the early versions of the FORTRAN programming language: Hollerith string constants.

A Hollerith constant is a way to represent a string of characters. It consists of a count, followed by the letter 'H', followed by the number of characters specified by the count. Here are two examples, with their Python equivalents:

1HX'X'
10H0123456789'0123456789'

We'll write our pattern so that the 'H' can be either uppercase or lowercase.

Here's the complete script. We start with the usual preliminaries: imports, some test strings, the main, and a function to run each test.

hollerith
#!/usr/bin/env python
#================================================================
# hollerith:  Demonstrate Forward class
#----------------------------------------------------------------
import sys
import pyparsing as pp

# - - - - -   M a n i f e s t   c o n s t a n t s

TEST_STRINGS = [ '1HX', '2h$#', '10H0123456789', '999Hoops']

# - - - - -   m a i n

def main():
    holler = hollerith()
    for text in TEST_STRINGS:
        test(holler, text)

# - - -   t e s t

def test(pat, text):
    '''Test to see if text matches parser (pat).
    '''
    print "--- Test for '{0}'".format(text)
    try:
        result = pat.parseString(text)
        print "  Matches: '{0}'".format(result[0])
    except pp.ParseException as x:
        print "  No match: '{0}'".format(str(x))

Next we'll define the function hollerith() that returns a parse for a Hollerith string.

hollerith
# - - -   h o l l e r i t h

def hollerith():
    '''Returns a parser for a FORTRAN Hollerith character constant.
    '''

First we define a parser intExpr that matches the character count. It has a parse action that converts the number from character form to a Python int. The lambda expression defines a nameless function that takes a list of tokens and converts the first token to an int.

hollerith
    #--
    # Define a recognizer for the character count.
    #--
    intExpr = pp.Word(pp.nums).setParseAction(lambda t: int(t[0]))

Next we create an empty Forward parser as a placeholder for the logic that matches the 'H' and the following characters.

hollerith
    #--
    # Allocate a placeholder for the rest of the parsing logic.
    #--
    stringExpr = pp.Forward()

Next we define a closure that will be added to intExpr as a second parse action. Notice that we are defining a function within a function. The countedParseAction function will retain access to an external name (stringExpr, which is defined in the outer function's scope) after the function is defined.

hollerith
    #--
    # Define a closure that transfers the character count from
    # the intExpr to the stringExpr.
    #--
    def countedParseAction(toks):
        '''Closure to define the content of stringExpr.
        '''

The argument is the list of tokens that was recognized by intExpr; because of its parse action, this list contains the count as a single int.

hollerith
        n = toks[0]

The contents parser will match exactly n characters. We'll use Section 5.5, “CharsNotIn: Match characters not in a given set” to do this match, specifying the excluded characters as an empty string so that any character will be included. Incidentally, this does not for n==0, but '0H' is not a valid Hollerith literal. A more robust implementation would raise a pp.ParseException in this case.

hollerith
        #--
        # Create a parser for any (n) characters.
        #--
        contents = pp.CharsNotIn('', exact=n)

This next line inserts the final pattern into the placeholder parser: an 'H' in either case followed by the contents pattern. The '<<' operator is overloaded in the Forward class to perform this operation: for any Forward recognizer F and any parser p, the expression “F << p” modifies F so that it matches pattern p.

hollerith
        #--
        # Store a recognizer for 'H' + contents into stringExpr.
        #--
        stringExpr << (pp.Suppress(pp.CaselessLiteral('H')) + contents)

Parse actions may elect to modify the recognized tokens, but we don't need to do that, so we return None to signify that the tokens remain unchanged.

hollerith
        return None

That is the end of the countedParseAction closure. We are now back in the scope of hollerith(). The next line adds the closure as the second parse action for the intExpr parser.

hollerith
    #--
    # Add the above closure as a parse action for intExpr.
    #--
    intExpr.addParseAction(countedParseAction)

Now we are ready to return the completed hollerith parser: intExpr recognizes the count and stringExpr recognizes the 'H' and string contents. When we return it, it is still just an empty Forward, but it will be filled in before it asked to parse.

hollerith
    #--
    # Return the completed pattern.
    #--
    return (pp.Suppress(intExpr) + stringExpr)

# - - - - -   E p i l o g u e

if __name__ == "__main__":
    main()

Here is the output of the script. Note that the last test fails because the '999H' is not followed by 999 more characters.

--- Test for '1HX'
  Matches: 'X'
--- Test for '2h$#'
  Matches: '$#'
--- Test for '10H0123456789'
  Matches: '0123456789'
--- Test for '999Hoops'
  No match: 'Expected !W:() (at char 8), (line:1, col:9)'