Next / Previous / Contents / TCC Help System / NM Tech homepage

20.5. Regular expression matching with the re module

The re module provides functions for matching strings against regular expressions. See the O'Reilly book Mastering Regular Expressions by Friedl and Oram for the whys and hows of regular expressions. We discuss only the commonest functions here. Refer to the Python Library Reference for the full feature set.

Note: The raw string notation r'...' is most useful for regular expressions; see raw strings, above.

These characters have special meanings in regular expressions:

.Matches any character except a newline.
^Matches the start of the string.
$Matches the end of the string.
r*Matches zero or more repetitions of regular expression r.
r+Matches one or more repetitions of r.
r?Matches zero or one r.
r*?Non-greedy form of r*; matches as few characters as possible. The normal * operator is greedy: it matches as much text as possible.
r+?Non-greedy form of r+.
r??Non-greedy form of r?.
r{m,n}Matches from m to n repetitions of r. For example, r'x{3,5}' matches between three and five copies of letter 'x'; r'0{4}' matches the string '0000'.
r{m,n}?Non-greedy version of the previous form.
[...]Matches one character from a set of characters. You can put all the allowable characters inside the brackets, or use a-b to mean all characters from a to b inclusive. For example, regular expression r'[abc]' will match either 'a', 'b', or 'c'. Pattern r'[0-9a-zA-Z]' will match any single letter or digit.
[^...]Matches any character not in the given set.
rsMatches expression r followed by expression s.
r|sMatches either r or s.
(r)Matches r and forms it into a group that can be retrieved separately after a match; see MatchObject, below. Groups are numbered starting from 1.
(?:r)Matches r but does not form a group for later retrieval.
(?P<n>r)Matches r and forms it into a named group, with name n, for later retrieval.

These special sequences are recognized:

\nMatches the same text as a group that matched earlier, where n is the number of that group. For example, r'([a-zA-Z]+):\1' matches the string "foo:foo".
\AMatches only at the start of the string.
\bMatches the empty string but only at the start or end of a word (where a word is set off by whitespace or a non-alphanumeric character). For example, r'foo\b' would match "foo" but not "foot".
\BMatches the empty string when not at the start or end of a word.
\dMatches any digit.
\DMatches any non-digit.
\sMatches any whitespace character.
\SMatches any non-whitespace character.
\wMatches any alphanumeric character.
\WMatches any non-alphanumeric character.
\ZMatches only at the end of the string.
\\Matches a backslash (\) character.

There are two ways to use a re regular expression. Assuming you import the module with import re, you can test whether a regular expression r matches a string s with the construct re.match(r,s).

However, if you will be matching the same regular expression many times, the performance will be better if you compile the regular expression using re.compile(r), which returns a compiled regular expression object. You can then check a string s for matching by using the .match(s) method on that object.

Here are the functions in module re:

compile(r[,f])

Compile regular expression r. Returns a compiled r.e. object; see the table of methods on such objects below. To get case-insensitive matching, use re.I as the f argument. There are other flags that may be passed to the f argument; see the Python Library Reference.

match(r,s[,f])

If r matches the start of string s, return a MatchObject (see below), otherwise return None.

search(r,s[,f])

Like the match() method, but matches r anywhere in s, not just at the beginning.

split(r,s[,maxsplit=m])

Splits string s into pieces where pattern r occurs. If r does not contain groups, returns a list of the parts of s that match r, in order. If r contains groups, returns a list containing all the characters from s, with parts matching r in separate elements from the non-matching parts. If the m argument is given, it specifies the maximum number of pieces that will be split, and the leftovers will be returned as an extra string at the end of the list.

sub(r,R,s[,count=c])

Replace the leftmost non-overlapping parts of s that match r using R; returns s if there is no match. The R argument can be a string or a function that takes one MatchObject argument and returns the string to be substituted. If the c argument is supplied (defaulting to 0), no more than c replacements are done, where a value of 0 means do them all.

20.5.1. Compiled regular expression objects

Compiled regular expression objects returned by re.compile() have these methods:

.match(s[,[ps][,pe]])

If the start of string s matches, return a MatchObject; if there is no match, return None. If ps is given, it specifies the index within s where matching is to start; this defaults to 0. If pe is given, it specifies the maximum length of s that can be used in matching.

.search(s[,[ps][,pe]])

Like match(), but matches anywhere in s.

.split(s[,maxsplit=m])

Like re.split().

.sub(R,s[,count=c])

Like re.sub().

.pattern

The string from which this object was compiled.

20.5.2. Methods on a MatchObject

A MatchObject is the object returned by .match() or other methods. Such an object has these methods:

.group([n])

Retrieves the text that matched. If there are no arguments, returns the entire string that matched. To retrieve just the text that matched the nth group, pass in an integer n, where the groups are numbered starting at 1. For example, for a MatchObject m, m.group(2) would return the text that matched the second group, or None if there were no second group.

If you have named the groups in your regular expression using a construct of the form (?P<name>...), the n argument can be the name as a string. For example, if you have a group (?P<year>[\d]{4}) (which matches four digits), you can retrieve that field using m.group("year").

.groups()

Return a tuple (s1,s2,...) containing all the matched strings, where si is the string that matched the ith group.

.start([n])

Returns the location where a match started. If no argument is given, returns the index within the string where the entire match started. If an argument n is given, returns the index of the start of the match for the nth group.

.end([n])

Returns the location where a match ended. If no argument is given, returns the index of the first character past the match. If n is given, returns the index of the first character past where the nth group matched.

.span([n])

Returns a 2-tuple (m.start(n),m.end(n)).

.pos

The effective ps value passed to .match() or .search().

.endpos

The effective pe value passed to .match() or .search().

.re

The regular expression object used to produce this MatchObject.

.string

The s argument passed to .match() or .search().