Next / Previous / Contents / TCC Help System / NM Tech homepage

Abstract

A quick reference guide for pyparsing, a recursive descent parser framework for the Python programming language.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. pyparsing: A tool for extracting information from text
2. Structuring your application
3. A small, complete example
4. How to structure the returned ParseResults
4.1. Use pp.Group() to divide and conquer
4.2. Structuring with results names
5. Classes
5.1. ParserElement: The basic parser building block
5.2. And: Sequence
5.3. CaselessKeyword: Case-insensitive keyword match
5.4. CaselessLiteral: Case-insensitive string match
5.5. CharsNotIn: Match characters not in a given set
5.6. Combine: Fuse components together
5.7. Dict: A scanner for tables
5.8. Each: Require components in any order
5.9. Empty: Match empty content
5.10. FollowedBy: Adding lookahead constraints
5.11. Forward: The parser placeholder
5.12. GoToColumn: Advance to a specified position in the line
5.13. Group: Group repeated items into a list
5.14. Keyword: Match a literal string not adjacent to specified context
5.15. LineEnd: Match end of line
5.16. LineStart: Match start of line
5.17. Literal: Match a specific string
5.18. MatchFirst: Try multiple matches in a given order
5.19. NoMatch: A parser that never matches
5.20. NotAny: General lookahead condition
5.21. OneOrMore: Repeat a pattern one or more times
5.22. Optional: Match an optional pattern
5.23. Or: Parse one of a set of alternatives
5.24. ParseException
5.25. ParseFatalException: Get me out of here!
5.26. ParseResults: Result returned from a match
5.27. QuotedString: Match a delimited string
5.28. Regex: Match a regular expression
5.29. SkipTo: Search ahead for a pattern
5.30. StringEnd: Match the end of the text
5.31. StringStart: Match the start of the text
5.32. Suppress: Omit matched text from the result
5.33. Upcase: Uppercase the result
5.34. White: Match whitespace
5.35. Word: Match characters from a specified set
5.36. WordEnd: Match only at the end of a word
5.37. WordStart: Match only at the start of a word
5.38. ZeroOrMore: Match any number of repetitions including none
6. Functions
6.1. col(): Convert a position to a column number
6.2. countedArray: Parse N followed by N things
6.3. delimitedList(): Create a parser for a delimited list
6.4. dictOf(): Build a dictionary from key/value pairs
6.5. downcaseTokens(): Lowercasing parse action
6.6. getTokensEndLoc(): Find the end of the tokens
6.7. line(): In what line does a location occur?
6.8. lineno(): Convert a position to a line number
6.9. matchOnlyAtCol(): Parse action to limit matches to a specific column
6.10. matchPreviousExpr(): Match the text that the preceding expression matched
6.11. matchPreviousLiteral(): Match the literal text that the preceding expression matched
6.12. nestedExpr(): Parser for nested lists
6.13. oneOf(): Check for multiple literals, longest first
6.14. srange(): Specify ranges of characters
6.15. removeQuotes(): Strip leading trailing quotes
6.16. replaceWith(): Substitute a constant value for the matched text
6.17. traceParseAction(): Decorate a parse action with trace output
6.18. upcaseTokens(): Uppercasing parse action
7. Variables
7.1. alphanums: The alphanumeric characters
7.2. alphas: The letters
7.3. alphas8bit: Supplement Unicode letters
7.4. cStyleComment: Match a C-language comment
7.5. commaSeparatedList: Parser for a comma-separated list
7.6. cppStyleComment: Parser for C++ comments
7.7. dblQuotedString: String enclosed in "..."
7.8. dblSlashComment: Parser for a comment that starts with “//
7.9. empty: Match empty content
7.10. hexnums: All hex digits
7.11. javaStyleComment: Comments in Java syntax
7.12. lineEnd: An instance of LineEnd
7.13. lineStart: An instance of LineStart
7.14. nums: The decimal digits
7.15. printables: All the printable non-whitespace characters
7.16. punc8bit: Some Unicode punctuation marks
7.17. pythonStyleComment: Comments in the style of the Python language
7.18. quotedString: Parser for a default quoted string
7.19. restOfLine: Match the rest of the current line
7.20. sglQuotedString: String enclosed in '...'
7.21. stringEnd: Matches the end of the string
7.22. unicodeString: Match a Python-style Unicode string

1. pyparsing: A tool for extracting information from text

The purpose of the pyparsing module is to give programmers using the Python programming language a tool for extracting information from structured textual data.

In terms of power, this module is more powerful than regular expressions, as embodied in the Python re module, but not as general as a full-blown compiler.

In order to find information within structured text, we must be able to describe that structure. The pyparsing module builds on the fundamental syntax description technology embodied in Backus-Naur Form, or BNF. Some familiarity with the various syntax notations based on BNF will be most helpful to you in using this package.

The way that the pyparsing module works is to match patterns in the input text using a recursive descent parser: we write BNF-like syntax productions, and pyparsing provides a machine that matches the input text against those productions.

The pyparsing module works best when you can describe the exact syntactic structure of the text you are analyzing. A common application of pyparsing is the analysis of log files. Log file entries generally have a predictable structure including such fields as dates, IP addresses, and such. Possible applications of the module to natural language work are not addressed here.

Useful online references include:

Note

Not every feature is covered here; this document is an attempt to cover the features most people will use most of the time. See the reference documentation for all the grisly details.

In particular, the author feels strongly that pyparsing is not the right tool for parsing XML and HTML, so numerous related features are not covered here. For a much better XML/HTML tool, see Python XML processing with lxml.