13.3. Regular Expressions

Regular expressions are used to search for patterns in strings. (The pattern is called the regular expression or regex for short.) To make this notion more concrete, think about the format of social-security numbers (SSN). If you were asked to describe it, you might say that SSNs consist of three digits, then a dash, two digits, another dash, then four digits. The SSN is an example of a formal language, meaning it consists of strings that are described by a regular expression. In our example, any SSN can be described by the following regular expression:

[0-9]{3}-[0-9]{2}-[0-9]{4}

Although we haven’t yet introduced the syntax and special characters associated with regular expressions, or even how to “parse” them, you can probably make sense of the pattern above. It follows closely the written description of a SSN that we just gave. While cryptic at a first glance, the syntax of regular expressions is fortunately quite simple to learn; we introduce nearly all of the syntax in this section alone.

As we introduce the concepts, we tackle some of the examples described in an earlier section, and show how to carry out the tasks with regular expressions. Almost all programming languages have a library to match patterns using regular expressions, making regular expressions useful regardless of the specific programming language. We use some of the common methods available in the Python built-in re module to accomplish the tasks from the examples. These methods are summarized in a table at the end of this section, where the basic usage and return value are briefly described. Since we only cover a few of the most commonly used methods, you may find it useful to consult the official documentation on the re module as well.

Core to the paradigm of a regular expression is the notion of searching in a string, one character (aka literal) at a time, for a pattern. We call this notion concatenation of literals.

13.3.1. Concatenation of Literals

Concatenation is best explained with a basic example. Suppose we are looking for the pattern “cat” in the string “The cad hid his coat. Scat!” It might not seem lilke it, but there is indeed one match of this pattern in the string. Here’s how to think about pattern matching when the pattern is a collection of literals, such as “cat”.

  • Begin with the first character in the string, and check whether it matches the first character in your pattern (that’s “c” in our simple example).

  • If there isn’t a match, then continue the search, moving left to right, one literal in the string at a time, until you find the first literal in your pattern (the “c”).

  • Once you find a match in the string of the pattern’s first literal, proceed to check the following literals. That is, check whether the “c” is followed by an “a”, and the “a” is followed by a “t”.

  • If the consecutive search literals in the pattern don’t match completely, then back up in the string to the first literal that follows your original single match, and start over, moving one literal at a time through the string in search of the first character in your pattern (the “c”).

Figure X contains a diagram of the idea behind this search through the string one character at a time. The pattern “cat” is found within the word “Scat” in positions 24-26 in the string. Once you get the hang of this process, you can move on to the richer set of patterns; they all follow from this basic paradigm.

Note

In the example above we observe that regular expressions can match patterns that appear anywhere in the input string. In Python, this behavior differs depending on the method used to match the regex—some methods only return a match if the regex appears at the start of the string; some methods return a match anywhere in the string.

Character Classes
At times, we want to bring flexibility into a pattern. The literal could be any digit or any letter. The character class (also known as a character set) let’s us specify a collection of equivalent characters to match. This allows us to create more relaxed matches. To create a character class, wrap the set of desired characters in brackets [ ]. For example, the following regular expression, matches three digits.

"[0123456789][0123456789][0123456789]"

Despite the pattern consisting of 36 characters, it only matches 3 literals because the entire segment “[0123456789]” is treated as one literal that can be 0 or 1 or … or 9. In fact, this is such a commonly used character class that there is a shorthand notation for the range of digits, “[0-9]”. Character classes allow us to create a more specific regex for SSNs.

'[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'

Two other ranges that are commonly used in character classes are [a-z] for lowercase and [A-Z] for uppercase letters. We can combine ranges with other equivalent characters and use partial ranges. For example [a-cX-Z27] is equivalent to the character class, [abcXYZ27].

Let’s return to our original pattern “cat” and modify it to include two character classes:

"c[oa][td]"

This pattern still matches three consecutive literals, but now these may be cat, cot, cad, or cod. The diagram in Figure X, shows the idea behind the search through the same string, “The cad hid his coat. Scat!”

Negated Character Classes A negated character class matches any character except those between the square brackets. To create a negated character class, place the caret symbol as the first character after the left sqaure bracket. For example, [^0-9] matches any character except a digit.

Wildcard Character
When we really don’t care what the literal is, we can specify this with the period character .. This matches any character except a newline.

Escaping Meta Characters
We have now seen several special characters, called meta characters: [ and ] denote a character class ^ switches to a negated character class, . represents any character, and - denotes a range. But, sometimes we might want to create a pattern that contains one of these literals. When this happens, we must escape it with a backslash. Recall how we aimed to split the web log string at either a left square bracket, forward slash or colon. Now we see that we used a character class to specify these, and since [ is a meta character, we used a backslash to escape it.

import re

pattern = r'[\[/:]' 
re.split(pattern, log_entry)[1:4]
['26', 'Jan', '2004']

Next, we will show how quantifiers can help create a more compact and clear regular expression for SSNs.

13.3.2. Quantifiers

To create a regex to match SSNs, we wrote:

'[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'

This matches 3 digits, a dash, 2 more digits, a dash, and 4 more digits.

Quantifiers allow us to match multiple consecutive appearances of a literal. We specify the number of repetitions by placing the number in curly braces { }.

SSN_regex = '[0-9]{3}-[0-9]{2}-[0-9]{4}'
re.findall(SSN_regex, 'My SSN is 382-34-3840.')
['382-34-3840']
SSN_regex = '[0-9]{3}-[0-9]{2}-[0-9]{4}'
re.findall(SSN_regex, 'My SSN is 382-34-38420.')
['382-34-3842']

A quantifier always modifies the character or character class to its immediate left. The following table shows the complete syntax for quantifiers.

Quantifier

Meaning

{m, n}

Match the preceding character m to n times.

{m}

Match the preceding character exactly m times.

{m,}

Match the preceding character at least m times.

{,n}

Match the preceding character at most n times.

Shorthand Quantifiers
Some commonly used quantifiers have a shorthand:

Symbol

Quantifier

Meaning

*

{0,}

Match the preceding character 0 or more times

+

{1,}

Match the preceding character 1 or more times

?

{0,1}

Match the preceding charcter 0 or 1 times

We use the * character instead of {0,} in the following examples.

Quantifiers are greedy
Quantifiers will return the longest match possible. This sometimes results in surprising behavior. Since a SSN starts and ends with a digit, we might think the following shorter regex will be a simpler approach for finding SSNs. Can you figure out what went wrong in the matching?

SSN_regex = '[0-9].+[0-9]'
re.findall(SSN_regex, 'My SSN is 382-34-38420 and hers is 382-34-3333.')
['382-34-38420 and hers is 382-34-3333']

In many cases, using a more specific character class prevents these false “over” matches:

SSN_regex = '[0-9\-]+[0-9]'
re.findall(SSN_regex, 'My SSN is 382-34-38420 and hers is 382-34-3333.')
['382-34-38420', '382-34-3333']

Literal concatenation and quantifiers are two of the core concepts in regular expressions. Two more core concepts are alternation and grouping, which we introduce next.

13.3.3. Alternation and Grouping to Create Features

Character classes are useful when we consider a single literal as equivalent match. But, at times we also want to match one group of literals or another, such cad or coat. In this case the two possibilities consist of a different number of literals. In the food safety example in {numref}‘Chapter %s ch:wrangling’, also described earlier in this chapter, we created several indicators for different sorts of violations. There were based on the presence of one of a set of words in the violation description. We demonstrate the search for human-related contamination in the description below.

re.search("hand|nail|hair|glove", 
          "unclean hands or improper use of gloves")
<re.Match object; span=(8, 12), match='hand'>
re.search("hand|nail|hair|glove", 
          "Unsanitary employee garments hair or nails")
<re.Match object; span=(29, 33), match='hair'>

Parentheses can be used to control the order of operations in regular expressions. Below is a silly regex that demonstrates this concept. The regex searches for patterns that begin with m, end with n, and have an even number of o’s or u’s in between. We substitute each match with an X to show that it found the correct occurrences.

re.sub("m((uu)+|(oo)+)n", "X", 
       "the moon is not mon but is muuuun and not muuun nor muuoon" )
'the X is not mon but is X and not muuun nor muuoon'

Take a close look at this regex; the outermost parenthesis accomplish the constraint that the pattern begins with m and ends with n. Within these parentheses we are looking for the pattern to the left or the one on the right. The pattern on the left must consist of one or more double u’s, and the one of right matches one or more double o’s. This is equivalent to the regex

re.sub("m(uu(uu)*|oo(oo)*)n", "X", 
       "the moon is not mon but is muuuun and not muuun nor muuoon" )
'the X is not mon but is X and not muuun nor muuoon'

We have just seen how parentheses can be used to specify the order of operations. Parentheses have another meaning: every set of parentheses specifies a regex group, which allows us to identify subpatterns in a string. When a pattern contains regex groups, re.findall returns a list of tuples that contain the subpattern contents. For example, recall the task of extracting from a web log the day, month and year. The following regex creates three regex groups, one for each of these.

pattern = "\[([0-9]{2})/([a-zA-z]{3})/([0-9]{4})"
re.findall(pattern, log_entry)
[('26', 'Jan', '2004')]

We have wrapped each digit group in parentheses. Below is a more compact regular expression that uses two shorthand names for character classes, \d for digits and \w for letters and numbers.

pattern = "\[(\d{2})/(\w{3})/(\d{4})"
re.findall(pattern, log_entry)
[('26', 'Jan', '2004')]

As promised, re.findall returns a list of tuples containing the individual components of the date and time of the web log.

We have introduced alot of terminology in the subsections of this section, and in the next section, we bring it all together into a set of tables for easy reference.

13.3.4. Reference Tables

We conclude this section by collecting together into a few tables, order of operation, meta characters, and shorthands for character classes. Additionally, we provide tables summarizing the handful of methods in the re Python library that we have used in this section. These tables are not meant to be an exhaustive collection, but with the concepts of literals, character classes, quantifiers, alternation, and grouping, and the specifics in these tables, you should be well equipped to wrangle text into usable data.

The four basic operations for regular expressions, concatenation, quantifying, alternation, and grouping have an order of precedence, which we make explicit in the table below.

Table 13.1 Order of Operaions

Operation

Order

Example

Matches

concatenation

3

cat

cat

alternation

4

cat\|mouse

cat and mouse

quantifying

2

cat?

ca and cat

grouping

1

c(at)?

c and cat

The following table provides a list of the meta characters introduced in this section, plus a few more. The column labeled “Doesn’t Match” is meant to provide insight into their usage.

Table 13.2 Meta characters

Char

Description

Example

Matches

Doesn’t Match

.

Any character except \n

...

abc

ab
abcd

[ ]

Any character inside brackets

[cb.]ar

car
.ar

jar

[^ ]

Any character not inside brackets

[^b]ar

car
par

bar
ar

*

≥ 0 or more of previous symbol, shorthand for {0,}

[pb]*ark

bbark
ark

dark

+

≥ 1 or more of previous symbol,shorthand for {1,}

[pb]+ark

bbpark
bark

dark
ark

?

0 or 1 of previous symbol, shorthand for {0,1}

s?he

she
he

the

{n}

Exactly n of previous symbol

hello{3}

hellooo

hello

|

Pattern before or after bar

we|[ui]s

we
us
is

e
s

\

Escape next character

\[hi\]

[hi]

hi

^

Beginning of line

^ark

ark two

dark

$

End of line

ark$

noahs ark

noahs arks

Additionally, we provide a table of shorthands for some commonly used character sets. Notice that these do not require the use of [ ] to specify them.

Table 13.3 Character Class Shorthands

Description

Bracket Form

Shorthand

Alphanumeric character

[a-zA-Z0-9_]

\w

Not an alphanumeric character

[^a-zA-Z0-9_]

\W

Digit

[0-9]

\d

Not a digit

[^0-9]

\D

Whitespace

[\t\n\f\r\p{Z}]

\s

Not whitespace

[^\t\n\f\r\p{z}]

\S

We used the following methods in re in this section. The names of the methods are indicative of the functionality they perform: search for a pattern in a string; find all cases of a pattern in a string; substitute all occurrences of a pattern with a substring, and split a string into pieces at the pattern. Each, requires a pattern and string to be specified (sub requires a replacement string as well), and some have additional arguments, which we do not document. The table below provides the format of the method usage and a description of the return value.

Table 13.4 Regular Expression Methods

Method

Return value

re.search(pattern, string)

truthy match object if the pattern is found, otherwise None

re.findall(pattern, string)

list of all matches of pattern in string

re.sub(pattern, replacement, string)

string where all occurrences of pattern are replaced by replacement in the string

re.split(pattern, string)

list of the pieces of string around the occurrences of pattern

Regex and pandas
As seen in the previous section, pandas Series objects have a .str property that supports string manipulation using Python string methods. Conveniently, the .str property also supports some functions from the re module. The table below shows the analogous functionality from the above table of the re methods. Each requires a pattern, and additional arguments are not documented. For the complete documentation on pandas string methods, see https://pandas.pydata.org/pandas-docs/stable/text.html

Table 13.5 Regular Expressions in Pandas

Method

Return value

str.contains(pattern)

Series of booleans indicating whether the pattern is found

str.findall(pattern)

list of all matches of pattern

str.replace(pattern, replacement)

Series with all matching occurrences of pattern replaced by replacement

str.split(pattern)

Series of lists of strings around given pattern

In the next section, we carry out a text analysis. We first clean the data using regular expressions and string manipulatio, then we convert the text into quantiative data, and analyse the text via these derivied quantitites.