Next / Previous / Contents / Shipman's homepage

28.5. re: Regular expression pattern-matching

The re module provides functions for matching strings against regular expressions. See the O'Reilly book Mastering Regular Expressions by Friedl and Oram for the whys and hows of regular expressions. We discuss only the most common functions here. Refer to the Python Library Reference for the full feature set.

28.5.1. Characters in regular expressions

Note: The raw string notation r'...' is most useful for regular expressions; see raw strings, above.

These characters have special meanings in regular expressions:

.Matches any character except a newline.
^Matches the start of the string.
$Matches the end of the string.
r* Matches zero or more repetitions of regular expression r.
r+ Matches one or more repetitions of r.
r? Matches zero or one r.
r*? Non-greedy form of r*; matches as few characters as possible. The normal * operator is greedy: it matches as much text as possible.
r+? Non-greedy form of r+.
r?? Non-greedy form of r?.
r{m,n} Matches from m to n repetitions of r. For example, r'x{3,5}' matches between three and five copies of letter 'x'; r'(bl){4}' matches the string 'blblblbl'.
r{m,n}? Non-greedy version of the previous form.
[...] Matches one character from a set of characters. You can put all the allowable characters inside the brackets, or use a-b to mean all characters from a to b inclusive. For example, regular expression r'[abc]' will match either 'a', 'b', or 'c'. Pattern r'[0-9a-zA-Z]' will match any single letter or digit.
[^...] Matches any character not in the given set.
rs Matches expression r followed by expression s.
r|s Matches either r or s.
(r) Matches r and forms it into a group that can be retrieved separately after a match; see MatchObject, below. Groups are numbered starting from 1.
(?:r) Matches r but does not form a group for later retrieval.
(?P<n>r) Matches r and forms it into a named group, with name n, for later retrieval.
(?P=n) Matches whatever string matched an earlier (?P<n>r) group.
(?#...) Comment: the “...” portion is ignored and may contain a comment.
(?=...) The “...” portion must be matched, but is not consumed by the match. This is sometimes called a lookahead match. For example, r'a(?=bcd)' matches the string 'abcd' but not the string 'abcxyz'. Compared to using r'abcd' as the regular expression, the difference is that in this case the matched portion would be 'a' and not 'abcd'.
(?!...) This is similar to the (?=...): it specifies a regular expression that must not match, but does not consume any characters. For example, r'a(?!bcd)' would match 'axyz', and return 'a' as the matched portion; but it would not match 'abcdef'. You could call it a negative lookahead match.

The special sequences in the table below are recognized. However, many of them function in ways that depend on the locale; see Section 19.4, “What is the locale?”. For example, the r'\s' sequence matches characters that are considered whitespace in the current locale.

\n Matches the same text as a group that matched earlier, where n is the number of that group. For example, r'([a-zA-Z]+):\1' matches the string "foo:foo".
\A Matches only at the start of the string.
\b Matches the empty string but only at the start or end of a word (where a word is set off by whitespace or a non-alphanumeric character). For example, r'foo\b' would match "foo" but not "foot".
\B Matches the empty string when not at the start or end of a word.
\d Matches any digit.
\D Matches any non-digit.
\s Matches any whitespace character.
\S Matches any non-whitespace character.
\w Matches any alphanumeric character plus the underbar '_'.
\W Matches any non-alphanumeric character.
\Z Matches only at the end of the string.
\\ Matches a backslash (\) character.

28.5.2. Functions in the re module

There are two ways to match regular expressions with the re module. Assuming you import the module with import re, you can test whether a regular expression r matches a string s with the construct:

re.match(r,s)

However, if you will be matching the same regular expression many times, the performance will be better if you compile the regular expression like this:

re.compile(r)

The re.compile() function returns a compiled regular expression object. You can then check a string s for matching by using the .match(s) method on that object.

Here are the functions in module re:

compile(r[,flags])

Compile regular expression r. This function returns a compiled regular expression object; see Section 28.5.4, “Compiled regular expression objects”.

You can use the optional flags argument to change the behavior of the match. See Section 28.5.3, “Flags for regular expression functions”.

match(r,s[,flags])

If r matches the start of string s, return a MatchObject (see below), otherwise return None.

For the values of flags, see Section 28.5.3, “Flags for regular expression functions”.

search(r,s[,flags])

Like the match() method, but matches r anywhere in s, not just at the beginning.

For the values of flags, see Section 28.5.3, “Flags for regular expression functions”.

split(r,s[,maxsplit=m])

Splits string s into pieces where pattern r occurs. If r does not contain groups, returns a list of the parts of s that match r, in order. If r contains groups, returns a list containing all the characters from s, with parts matching r in separate elements from the non-matching parts. If the m argument is given, it specifies the maximum number of pieces that will be split, and the leftovers will be returned as an extra string at the end of the list.

sub(r,R,s[,count=c])

Replace the leftmost non-overlapping parts of s that match r using R; returns s if there is no match. The R argument can be a string or a function that takes one MatchObject argument and returns the string to be substituted. If the c argument is supplied (defaulting to 0), no more than c replacements are done, where a value of 0 means do them all.

28.5.3. Flags for regular expression functions

Several functions in the re module accept an optional flags argument that can change the behavior of those functions. This argument is formed by passing one of the flags below, or by using bitwise logical-or (“|”) to combine two or more of these flags.

I or IGNORECASE

Treat lowercase and uppercase letters the same.

L or LOCALE

Use the current locale to determine what characters match the special sequences \w, \W, \b, \B, \s, and \S.

M or MULTILINE

If you specify this flag, the “^” character matches the beginning of each line and the “$” character matches the end of each line. Without the flag, “^” matches only the beginning of the string and “$” matches only the end of the string.

S or DOTALL

If you specify this flag, the “.” character will match any character including a newline. Without it, “.” will not match a newline.

U or UNICODE

If you specify this flag, the meaning of the special sequences \w, \W, \b, \B, \s, and \S will depend on the Unicode character properties database.

Here's an example that shows how to do a case-insensitive match.

>>> import re
>>> print re.match(r'^[abc]+$', 'aabbbabc')
<_sre.SRE_Match object at 0x7f2ec8de8988>
>>> print re.match(r'^[abc]+$', 'aaBbbAbC')
None
>>> print re.match(r'^[abc]+$', 'aaBbbAbC', re.IGNORECASE)
<_sre.SRE_Match object at 0x7f2ec8de8988>

28.5.4. Compiled regular expression objects

Compiled regular expression objects returned by re.compile() have these methods:

.match(s[,[ps][,pe]])

If the start of string s matches, return a MatchObject; if there is no match, return None. If ps is given, it specifies the index within s where matching is to start; this defaults to 0. If pe is given, it specifies the maximum length of s that can be used in matching.

.pattern

The string from which this object was compiled.

.search(s[,[ps][,pe]])

Like match(), but matches anywhere in s.

.split(s[,maxsplit=m])

Like re.split().

.sub(R,s[,count=c])

Like re.sub().

28.5.5. Methods on a MatchObject

A MatchObject is the object returned by .match() or other methods. Such an object has these methods and attributes:

.end([n])

Returns the location where a match ended. If no argument is given, returns the index of the first character past the match. If n is given, returns the index of the first character past where the nth group matched.

.endpos

The effective pe value passed to .match() or .search().

.group([n])

Retrieves the text that matched. If there are no arguments, or if n is zero, it returns the entire string that matched.

To retrieve just the text that matched the nth group, pass in an integer n, where the groups are numbered starting at 1. For example, for a MatchObject m, m.group(2) would return the text that matched the second group, or None if there were no second group.

If you have named the groups in your regular expression using a construct of the form (?P<name>...), the n argument can be the name as a string. For example, if you have a group (?P<year>\d{4}) (which matches four digits), you can retrieve that field using m.group('year').

.groups([default])

Return a tuple (s1,s2,...) containing all the matched strings, where si is the string that matched the ith group.

For groups that did not match, the corresponding value in the tuple will be None, or an optional default value that you specify in the call to this method.

.groupdict([default])

Return a dictionary whose keys are the named groups in the regular expression. Each corresponding value will be the text that matched the group. If a group did not match, the corresponding value will be None, or an alternate default value that you supply when you call the method.

.lastgroup

Holds the name of the last named group (using the (?P<n>r) construct) that matched. It will be None if no named groups matched, or if the last group that matched was a numbered group and not a named group.

.lastindex

Holds the index of the last group that matched, or None if no groups matched.

.pos

The effective ps value passed to .match() or .search().

.re

The regular expression object used to produce this MatchObject.

.span([n])

Returns a 2-tuple (m.start(n),m.end(n)).

.start([n])

Returns the location where a match started. If no argument is given, returns the index within the string where the entire match started. If an argument n is given, returns the index of the start of the match for the nth group.

.string

The s argument passed to .match() or .search().