Intermediate | Regular Expressions In Python

Escaping characters¶

Suppose you have a file of students and their grades on a recent exam.

Bob: B-
Mary: C
Susan: A+
Ronald: F
Gerry: A+
Mark: C+

If you want to find all students who received an A+, you might try using the regular expression A+, but this matches on Allison: C+

Allison: C+  <- match
Bob: B-
Mary: C
Susan: A+    <- match
Ronald: F
Gerry: A+    <- match

That's because + is a special regular expression metacharacter meaning "one or more repetitions". In other words, A+ means match one or more repetitions of A.

In order to match a plus sign (ignoring the default interpretation of +), you must escape it with a backslash. For example A\+ means match the string A+.

Similarly, \\ matches a single slash, \. matches a period, \( matches an open parenthesis, and so on.

Raw string literals ('r' prefix)¶

Python doesn't always interpret strings exactly as you write them. For example, in the string "hello\nworld", Python interprets \n as a newline character. This becomes evident when you print() the string.

>>> print("hello\nworld")
hello
world

If you want to avoid this special interpretation of \n, you can prefix the string with the letter r.

>>> print(r"hello\nworld")
'hello\nworld'

A string prefixed with the letter r like this is called a raw string literal.

Tip

To see how Python interprets a string, just print() it.

>>> print("hello world")
hello world

>>> print("\hello world")
\hello world

>>> print("\\hello world")
\hello world

>>> print("\\\\hello world")
\\hello world

>>> print("hello \"world\"")
hello "world"

>>> print("hello\nworld")
hello
world

Raw string literals vs escaping characters

Raw string literals are an alternative to escaping characters. For example, consider these equivalent strings.

r"hello\nworld" == "hello\\nworld"
# True

Capture Groups¶

Capture groups let you select parts of a regular expression match. For example, given the string,

I went to the market and bought 12 eggs, 6 carrots, and 2 hams.

The expression "\d+ \w+" matches these substrings

I went to the market and bought 12 eggs, 6 carrots, and 2 hams.

By wrapping \d+ and \w+ in parentheses, each match suddenly has two nested sub-matches. These are called "capture groups".

1   2
12 eggs

1   2
6 carrots

1  2
2 hams

In Python, if we identify the first match using re.search(),

import re

phrase = "I went to the market and bought 12 eggs, 6 carrots, and 2 hams."
first_match = re.search(pattern="(\d+) (\w+)", string=phrase)

print(first_match)
# <re.Match object; span=(32, 39), match='12 eggs'>

we can access its groups using the .groups() method,

first_match.groups()
# ('12', 'eggs')

or we can access each group individually using the .group() method

first_match.group(1)  # '12'
first_match.group(2)  # 'eggs'

Nested and Non Capture Groups¶

Consider the following example that matches the capital Al or Bi followed by one or more word characters.

first_match = re.search(
    pattern="(Al|Bi)(\w+)", 
    string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Bi', 'lly')

There are two capture groups in this example: (Al|Bi) and (\w+). What if we wanted to make the entire pattern a single capture group? You might try

first_match = re.search(
    pattern="((Al|Bi)\w+)", 
    string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy', 'Bi')

but this returns two strings:

one representing the outer capture group, identified by the outermost parentheses ((Al|Bi)\w+)
another representing the inner capture group, identified by the innermost parentheses ((Al|Bi)\w+).

This is known as a nested capture group.

The issue of selecting a single capture group remains! The problem stems from the Or operator, because we cannot drop the parentheses surrounding (Al|Bi) without changing the pattern's meaning.

To get around this issue, we can change (Al|Bi) to (?:Al|Bi). The ?: bit signals the expression as a non capture group.

first_match = re.search(
    pattern="((?:Al|Bi)\w+)", 
    string="Amy loves Billy and hates Allen"
)
first_match.groups()
# ('Billy',)