Intermediate | Regular Expressions In Python
Escaping characters¶
Suppose you have a file of students and their grades on a recent exam.
Bob: B-
Mary: C
Susan: A+
Ronald: F
Gerry: A+
Mark: C+
If you want to find all students who received an A+
, you might try using the regular expression A+
, but this matches on Allison: C+
Allison: C+ <- match
Bob: B-
Mary: C
Susan: A+ <- match
Ronald: F
Gerry: A+ <- match
That's because +
is a special regular expression metacharacter meaning "one or more repetitions". In other
words, A+
means match one or more repetitions of A
.
In order to match a plus sign (ignoring the default interpretation of +
), you must escape it with a backslash.
For example A\+
means match the string A+
.
Similarly, \\
matches a single slash, \.
matches a period, \(
matches an open parenthesis, and so on.
Raw string literals ('r' prefix)¶
Python doesn't always interpret strings exactly as you write them. For example, in the string "hello\nworld"
, Python
interprets \n
as a newline character. This becomes evident when you print()
the string.
>>> print("hello\nworld")
hello
world
If you want to avoid this special interpretation of \n
, you can prefix the string with the letter r.
>>> print(r"hello\nworld")
'hello\nworld'
A string prefixed with the letter r like this is called a raw string literal.
Tip
To see how Python interprets a string, just print()
it.
>>> print("hello world")
hello world
>>> print("\hello world")
\hello world
>>> print("\\hello world")
\hello world
>>> print("\\\\hello world")
\\hello world
>>> print("hello \"world\"")
hello "world"
>>> print("hello\nworld")
hello
world
Raw string literals vs escaping characters
Raw string literals are an alternative to escaping characters. For example, consider these equivalent strings.
r"hello\nworld" == "hello\\nworld"
# True
Capture Groups¶
Capture groups let you select parts of a regular expression match. For example, given the string,
I went to the market and bought 12 eggs, 6 carrots, and 2 hams.
The expression "\d+ \w+"
matches these substrings
I went to the market and bought 12 eggs, 6 carrots, and 2 hams.
By wrapping \d+
and \w+
in parentheses, each match suddenly has two nested sub-matches. These are called "capture
groups".
1 2
12 eggs
1 2
6 carrots
1 2
2 hams
In Python, if we identify the first match using re.search()
,
import re
phrase = "I went to the market and bought 12 eggs, 6 carrots, and 2 hams."
first_match = re.search(pattern="(\d+) (\w+)", string=phrase)
print(first_match)
# <re.Match object; span=(32, 39), match='12 eggs'>
we can access its groups using the .groups()
method,
first_match.groups()
# ('12', 'eggs')
or we can access each group individually using the .group()
method
first_match.group(1) # '12'
first_match.group(2) # 'eggs'
Nested and Non Capture Groups¶
Consider the following example that matches the capital Al or Bi followed by one or more word characters.
first_match = re.search(
pattern="(Al|Bi)(\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Bi', 'lly')
There are two capture groups in this example: (Al|Bi)
and (\w+)
. What if we wanted to make the entire pattern a
single capture group? You might try
first_match = re.search(
pattern="((Al|Bi)\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy', 'Bi')
but this returns two strings:
- one representing the outer capture group, identified by the outermost parentheses
((Al|Bi)\w+)
- another representing the inner capture group, identified by the innermost parentheses
((Al|Bi)\w+)
.
This is known as a nested capture group.
The issue of selecting a single capture group remains! The problem stems from the Or operator, because we cannot drop
the parentheses surrounding (Al|Bi)
without changing the pattern's meaning.
To get around this issue, we can change (Al|Bi)
to (?:Al|Bi)
. The ?:
bit signals the expression as a non capture group.
first_match = re.search(
pattern="((?:Al|Bi)\w+)",
string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy',)