13.3. Regular Expressions

Regular expressions (or regex for short) are special patterns that we use to extract parts of strings. Think about the format of a Social Security Number (SSN) like 134-42-2012. To describe this format, we might say that SSNs consist of three digits, then a dash, two digits, another dash, then four digits. Regexes let us capture this pattern in code. For instance, a simple regex for SSNs could look like:

...-..-....

Although we haven’t yet introduced the syntax and special characters associated with regular expressions, or even how to “parse” them, you can probably make sense of the pattern above. It follows closely the written description of a SSN that we just gave. The syntax of regular expressions is fortunately quite simple to learn; we introduce nearly all of the syntax in this section alone.

As we introduce the concepts, we tackle some of the examples described in an earlier section, and show how to carry out the tasks with regular expressions. Almost all programming languages have a library to match patterns using regular expressions, making regular expressions useful regardless of the specific programming language. We use some of the common methods available in the Python built-in re module to accomplish the tasks from the examples. These methods are summarized in a table at the end of this section, where the basic usage and return value are briefly described. Since we only cover a few of the most commonly used methods, you may find it useful to consult the official documentation on the re module as well.

Regular expressions are based on searching a string one character (aka literal) at a time for a pattern. We call this notion concatenation of literals.

13.3.1. Concatenation of Literals

Concatenation is best explained with a basic example. Suppose we are looking for the pattern cat in the string Scatter!. Here’s how a computer matches literal patterns like cat:

  1. Begin with the first character in the string (S).

  2. Check whether it matches the first character in the pattern (c).

  3. If there isn’t a match, move onto the next character of the string.

  4. If there is a match, check the rest of the pattern (a, then t).

  5. If the entire pattern matches, report that the pattern was found.

Figure 13.1 contains a diagram of the idea behind this search through the string one character at a time. The pattern “cat” is found within the word Scatter! in positions 1-3 in the string. Once you get the hang of this process, you can move on to the richer set of patterns; they all follow from this basic paradigm.

../../_images/regex-literals.svg

Fig. 13.1 To match literal patterns, the computer moves along the string and checks whether the entire pattern is matched.

Note

In the example above we observe that regular expressions can match patterns that appear anywhere in the input string. In Python, this behavior differs depending on the method used to match the regex—some methods only return a match if the regex appears at the start of the string; some methods return a match anywhere in the string.

Character Classes

We can make patterns more flexible by using a character class (also known as a character set), which lets us specify a collection of equivalent characters to match. This allows us to create more relaxed matches. To create a character class, wrap the set of desired characters in brackets [ ]. For example, the pattern [0123456789] means “match any literal within the brackets”—in this case, any single digit. Then, the following regular expression matches three digits.

[0123456789][0123456789][0123456789]

This is such a commonly used character class that there is a shorthand notation for the range of digits, [0-9]. Character classes allow us to create a regex for SSNs:

[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]

Two other ranges that are commonly used in character classes are [a-z] for lowercase and [A-Z] for uppercase letters. We can combine ranges with other equivalent characters and use partial ranges. For example [a-cX-Z27] is equivalent to the character class, [abcXYZ27].

Let’s return to our original pattern cat and modify it to include two character classes:

c[oa][td]

This pattern matches cat, but it also matches cot, cad, and cod:

  Regex: c[oa][td]
   Text: The cat eats cod, cads, and cots, but not coats.
Matches:     ↑↑↑      ↑↑↑  ↑↑↑       ↑↑↑                 

Negated Character Classes

A negated character class matches any character except those between the square brackets. To create a negated character class, place the caret symbol as the first character after the left sqaure bracket. For example, [^0-9] matches any character except a digit.

Wildcard Character

When we really don’t care what the literal is, we can specify this with the period character .. This matches any character except a newline.

Escaping Meta Characters

We have now seen several special characters, called meta characters: [ and ] denote a character class ^ switches to a negated character class, . represents any character, and - denotes a range. But, sometimes we might want to create a pattern that matches one of these literals. When this happens, we must escape it with a backslash. For example, we can match the literal left bracket character using the regex \[.

  Regex: \[
   Text: Today is [2022/01/01]
Matches:          ↑           

Next, we will show how quantifiers can help create a more compact and clear regular expression for SSNs.

13.3.2. Quantifiers

To create a regex to match SSNs, we wrote:

[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]

This matches 3 digits, a dash, 2 more digits, a dash, and 4 more digits.

Quantifiers allow us to match multiple consecutive appearances of a literal. We specify the number of repetitions by placing the number in curly braces { }.

# We add the `r` character before the quotes to make a raw string, which makes
# regexes easier to write in Python
ssn_re = r'[0-9]{3}-[0-9]{2}-[0-9]{4}'

# Python's builtin re module has methods for matching regexes
import re
re.findall(ssn_re, 'My SSN is 382-34-3840.')
['382-34-3840']
# The pattern shouldn't match phone numbers
re.findall(ssn_re, 'My phone is 382-123-3842.')
[]

A quantifier always modifies the character or character class to its immediate left. The following table shows the complete syntax for quantifiers.

Quantifier

Meaning

{m, n}

Match the preceding character m to n times.

{m}

Match the preceding character exactly m times.

{m,}

Match the preceding character at least m times.

{,n}

Match the preceding character at most n times.

Shorthand Quantifiers

Some commonly used quantifiers have a shorthand:

Symbol

Quantifier

Meaning

*

{0,}

Match the preceding character 0 or more times

+

{1,}

Match the preceding character 1 or more times

?

{0,1}

Match the preceding charcter 0 or 1 times

We use the * character instead of {0,} in the following examples.

Quantifiers are greedy

Quantifiers will return the longest match possible. This sometimes results in surprising behavior. Since a SSN starts and ends with a digit, we might think the following shorter regex will be a simpler approach for finding SSNs. Can you figure out what went wrong in the matching?

ssn_re = r'[0-9].+[0-9]'
re.findall(ssn_re, 'My SSN is 382-34-3842 and hers is 382-34-3333.')
['382-34-3842 and hers is 382-34-3333']

In many cases, using a more specific character class prevents these false “over” matches:

ssn_re = r'[0-9\-]+[0-9]'
re.findall(ssn_re, 'My SSN is 382-34-3842 and hers is 382-34-3333.')
['382-34-3842', '382-34-3333']

Literal concatenation and quantifiers are two of the core concepts in regular expressions. Next, we’ll introduce two more core concepts: alternation and grouping.

13.3.3. Alternation and Grouping to Create Features

Character classes let us match multiple options for a single literal. We can use alternation to match multiple options for a group of literals. For example, in the food safety example in Chapter 9 we marked violations related to body parts by seeing if the violation had the substring hand, nail, hair, or glove. We can use the | character in a regex to specify this alteration:

body_re = r"hand|nail|hair|glove"
re.findall(body_re, "unclean hands or improper use of gloves")
['hand', 'glove']
re.findall(body_re, "Unsanitary employee garments hair or nails")
['hair', 'nail']
log_entry
'169.237.46.168 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"'

Grouping using parentheses

Every set of parentheses specifies a regex group, which allows us to extract parts of a pattern. For example, we can use groups to extract the day, month, year, and time from the web server log entry.

# This pattern matches the entire timestamp
time_re = r"\[[0-9]{2}/[a-zA-z]{3}/[0-9]{4}:[0-9:\- ]*\]"
re.findall(time_re, log_entry)
['[26/Jan/2004:10:47:58 -0800]']
# Same regex, but we use parens to make regex groups...
time_re = r"\[([0-9]{2})/([a-zA-z]{3})/([0-9]{4}):([0-9:\- ]*)\]"

# ...which tells findall() to split up the match into its groups
re.findall(time_re, log_entry)
[('26', 'Jan', '2004', '10:47:58 -0800')]

As we can see, re.findall returns a list of tuples containing the individual components of the date and time of the web log.

We have introduced a lot of terminology, so in the next section we’ll bring it all together into a set of tables for easy reference.

13.3.4. Reference Tables

We conclude this section with a few tables that summarize order of operation, meta characters, and shorthands for character classes. We also provide tables summarizing the handful of methods in the re Python library that we have used in this section.

The four basic operations for regular expressions, concatenation, quantifying, alternation, and grouping have an order of precedence, which we make explicit in the table below.

Table 13.1 Order of Operaions

Operation

Order

Example

Matches

concatenation

3

cat

cat

alternation

4

cat\|mouse

cat and mouse

quantifying

2

cat?

ca and cat

grouping

1

c(at)?

c and cat

The following table provides a list of the meta characters introduced in this section, plus a few more. The column labeled “Doesn’t Match” gives examples of strings that the example regexes don’t match.

Table 13.2 Meta characters

Char

Description

Example

Matches

Doesn’t Match

.

Any character except \n

...

abc

ab

[ ]

Any character inside brackets

[cb.]ar

car
.ar

jar

[^ ]

Any character not inside brackets

[^b]ar

car
par

bar
ar

*

≥ 0 or more of previous symbol, shorthand for {0,}

[pb]*ark

bbark
ark

dark

+

≥ 1 or more of previous symbol,shorthand for {1,}

[pb]+ark

bbpark
bark

dark
ark

?

0 or 1 of previous symbol, shorthand for {0,1}

s?he

she
he

the

{n}

Exactly n of previous symbol

hello{3}

hellooo

hello

|

Pattern before or after bar

we|[ui]s

we
us
is

e
s

\

Escape next character

\[hi\]

[hi]

hi

^

Beginning of line

^ark

ark two

dark

$

End of line

ark$

noahs ark

noahs arks

Additionally, we provide a table of shorthands for some commonly used character sets. These shorthands don’t need [ ].

Table 13.3 Character Class Shorthands

Description

Bracket Form

Shorthand

Alphanumeric character

[a-zA-Z0-9_]

\w

Not an alphanumeric character

[^a-zA-Z0-9_]

\W

Digit

[0-9]

\d

Not a digit

[^0-9]

\D

Whitespace

[\t\n\f\r\p{Z}]

\s

Not whitespace

[^\t\n\f\r\p{z}]

\S

We used the following methods in re in this section. The names of the methods are indicative of the functionality they perform: search or match a pattern in a string; find all cases of a pattern in a string; substitute all occurrences of a pattern with a substring, and split a string into pieces at the pattern. Each, requires a pattern and string to be specified, and some have extra arguments. The table below provides the format of the method usage and a description of the return value.

Table 13.4 Regular Expression Methods

Method

Return value

re.search(pattern, string)

truthy match object if the pattern is found anywhere in the string, otherwise None

re.match(pattern, string)

truthy match object if the pattern is found at the beginning of the string, otherwise None

re.findall(pattern, string)

list of all matches of pattern in string

re.sub(pattern, replacement, string)

string where all occurrences of pattern are replaced by replacement in the string

re.split(pattern, string)

list of the pieces of string around the occurrences of pattern

Regex and pandas

As seen in the previous section, pandas Series objects have a .str property that supports string manipulation using Python string methods. Conveniently, the .str property also supports some functions from the re module. The table below shows the analogous functionality from the above table of the re methods. Each requires a pattern. See the pandas docs for a complete list of string methods.

Table 13.5 Regular Expressions in Pandas

Method

Return value

str.contains(pattern)

Series of booleans indicating whether the pattern is found

str.findall(pattern)

list of all matches of pattern

str.replace(pattern, replacement)

Series with all matching occurrences of pattern replaced by replacement

str.split(pattern)

Series of lists of strings around given pattern

Regular expressions are a powerful tool, but are somewhat notorious for being difficult to read and debug. We close with some advice for regexes.

  • Develop your regular expression on simple test strings to see what the pattern matches.

  • If a pattern matches nothing, try weakening it by dropping part of the pattern. Then tighten it incrementally to see how the matching evolves. (Online regex checking tools can be very helpful here).

  • Use raw strings whenever possible for cleaner patterns, especially when a pattern includes a backslash.

  • When you have lots of long strings, consider using compiled patterns because they can be faster to match (see compile() in the re library).

While powerful, regular expressions are terrible at certain types of problems. Don’t use them to:

  • Parse hierarchical structures such as JSON or HTML; use a parser instead.

  • Search for complex properties, like palindromes and balanced parentheses.

  • Validating a complex feature, such as a valid email address.

In the next section, we carry out an example text analysis. We’ll clean the data using regular expressions and string manipulation, convert the text into quantitative data, and analyze the text via these derived quantities.