Regular Expressions
Contents
13.3. Regular Expressions¶
Regular expressions are used to search for patterns in strings. (The pattern is called the regular expression or regex for short.) To make this notion more concrete, think about the format of social-security numbers (SSN). If you were asked to describe it, you might say that SSNs consist of three digits, then a dash, two digits, another dash, then four digits. The SSN is an example of a formal language, meaning it consists of strings that are described by a regular expression. In our example, any SSN can be described by the following regular expression:
[0-9]{3}-[0-9]{2}-[0-9]{4}
Although we haven’t yet introduced the syntax and special characters associated with regular expressions, or even how to “parse” them, you can probably make sense of the pattern above. It follows closely the written description of a SSN that we just gave. While cryptic at a first glance, the syntax of regular expressions is fortunately quite simple to learn; we introduce nearly all of the syntax in this section alone.
As we introduce the concepts, we tackle some of the examples described in an earlier section, and show how to carry out the tasks with regular expressions. Almost all programming languages have a library to match patterns using regular expressions, making regular expressions useful regardless of the specific programming language. We use some of the common methods available in the Python built-in re
module to accomplish the tasks from the examples. These methods are summarized in a table at the end of this section, where the basic usage and return value are briefly described. Since we only cover a few of the most commonly used methods, you may find it useful to consult the official documentation on the re
module as well.
Core to the paradigm of a regular expression is the notion of searching in a string, one character (aka literal) at a time, for a pattern. We call this notion concatenation of literals.
13.3.1. Concatenation of Literals¶
Concatenation is best explained with a basic example. Suppose we are looking for the pattern “cat” in the string “The cad hid his coat. Scat!” It might not seem lilke it, but there is indeed one match of this pattern in the string. Here’s how to think about pattern matching when the pattern is a collection of literals, such as “cat”.
Begin with the first character in the string, and check whether it matches the first character in your pattern (that’s “c” in our simple example).
If there isn’t a match, then continue the search, moving left to right, one literal in the string at a time, until you find the first literal in your pattern (the “c”).
Once you find a match in the string of the pattern’s first literal, proceed to check the following literals. That is, check whether the “c” is followed by an “a”, and the “a” is followed by a “t”.
If the consecutive search literals in the pattern don’t match completely, then back up in the string to the first literal that follows your original single match, and start over, moving one literal at a time through the string in search of the first character in your pattern (the “c”).
Figure X contains a diagram of the idea behind this search through the string one character at a time. The pattern “cat” is found within the word “Scat” in positions 24-26 in the string. Once you get the hang of this process, you can move on to the richer set of patterns; they all follow from this basic paradigm.
Note
In the example above we observe that regular expressions can match patterns that appear anywhere in the input string. In Python, this behavior differs depending on the method used to match the regex—some methods only return a match if the regex appears at the start of the string; some methods return a match anywhere in the string.
Character Classes
At times, we want to bring flexibility into a pattern. The literal could be any digit or any letter. The character class (also known as a character set) let’s us specify a collection of equivalent characters to match. This allows us to create more relaxed matches. To create a character class, wrap the set of desired characters in brackets [ ]
. For example, the following regular expression, matches three digits.
"[0123456789][0123456789][0123456789]"
Despite the pattern consisting of 36 characters, it only matches 3 literals because the entire segment “[0123456789]” is treated as one literal that can be 0 or 1 or … or 9. In fact, this is such a commonly used character class that there is a shorthand notation for the range of digits, “[0-9]”. Character classes allow us to create a more specific regex for SSNs.
'[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
Two other ranges that are commonly used in character classes are [a-z]
for lowercase and [A-Z]
for uppercase letters. We can combine ranges with other equivalent characters and use partial ranges. For example [a-cX-Z27]
is equivalent to the character class, [abcXYZ27]
.
Let’s return to our original pattern “cat” and modify it to include two character classes:
"c[oa][td]"
This pattern still matches three consecutive literals, but now these may be cat, cot, cad, or cod. The diagram in Figure X, shows the idea behind the search through the same string, “The cad hid his coat. Scat!”
Negated Character Classes
A negated character class matches any character except those between the square brackets. To create a negated character class, place the caret symbol as the first character after the left sqaure bracket. For example, [^0-9]
matches any character except a digit.
Wildcard Character
When we really don’t care what the literal is, we can specify this with the period character .
. This matches any character except a newline.
Escaping Meta Characters
We have now seen several special characters, called meta characters: [
and ]
denote a character class ^
switches to a negated character class, .
represents any character, and -
denotes a range. But, sometimes we might want to create a pattern that contains one of these literals. When this happens, we must escape it with a backslash. Recall how we aimed to split the web log string at either a left square bracket, forward slash or colon. Now we see that we used a character class to specify these, and since [
is a meta character, we used a backslash to escape it.
import re
pattern = r'[\[/:]'
re.split(pattern, log_entry)[1:4]
['26', 'Jan', '2004']
Next, we will show how quantifiers can help create a more compact and clear regular expression for SSNs.
13.3.2. Quantifiers¶
To create a regex to match SSNs, we wrote:
'[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
This matches 3 digits, a dash, 2 more digits, a dash, and 4 more digits.
Quantifiers allow us to match multiple consecutive appearances of a literal. We specify the number of repetitions by placing the number in curly braces { }
.
SSN_regex = '[0-9]{3}-[0-9]{2}-[0-9]{4}'
re.findall(SSN_regex, 'My SSN is 382-34-3840.')
['382-34-3840']
SSN_regex = '[0-9]{3}-[0-9]{2}-[0-9]{4}'
re.findall(SSN_regex, 'My SSN is 382-34-38420.')
['382-34-3842']
A quantifier always modifies the character or character class to its immediate left. The following table shows the complete syntax for quantifiers.
Quantifier |
Meaning |
---|---|
{m, n} |
Match the preceding character m to n times. |
{m} |
Match the preceding character exactly m times. |
{m,} |
Match the preceding character at least m times. |
{,n} |
Match the preceding character at most n times. |
Shorthand Quantifiers
Some commonly used quantifiers have a shorthand:
Symbol |
Quantifier |
Meaning |
---|---|---|
|
{0,} |
Match the preceding character 0 or more times |
|
{1,} |
Match the preceding character 1 or more times |
|
{0,1} |
Match the preceding charcter 0 or 1 times |
We use the *
character instead of {0,}
in the following examples.
Quantifiers are greedy
Quantifiers will return the longest match possible. This sometimes results in surprising behavior. Since a SSN starts and ends with a digit, we might think the following shorter regex will be a simpler approach for finding SSNs. Can you figure out what went wrong in the matching?
SSN_regex = '[0-9].+[0-9]'
re.findall(SSN_regex, 'My SSN is 382-34-38420 and hers is 382-34-3333.')
['382-34-38420 and hers is 382-34-3333']
In many cases, using a more specific character class prevents these false “over” matches:
SSN_regex = '[0-9\-]+[0-9]'
re.findall(SSN_regex, 'My SSN is 382-34-38420 and hers is 382-34-3333.')
['382-34-38420', '382-34-3333']
Literal concatenation and quantifiers are two of the core concepts in regular expressions. Two more core concepts are alternation and grouping, which we introduce next.
13.3.3. Alternation and Grouping to Create Features¶
Character classes are useful when we consider a single literal as equivalent match. But, at times we also want to match one group of literals or another, such cad
or coat
. In this case the two possibilities consist of a different number of literals. In the food safety example in
{numref}‘Chapter %s ch:wrangling’, also described earlier in this chapter, we created several indicators for different sorts of violations. There were based on the presence of one of a set of words in the violation description. We demonstrate the search for human-related contamination in the description below.
re.search("hand|nail|hair|glove",
"unclean hands or improper use of gloves")
<re.Match object; span=(8, 12), match='hand'>
re.search("hand|nail|hair|glove",
"Unsanitary employee garments hair or nails")
<re.Match object; span=(29, 33), match='hair'>
Parentheses can be used to control the order of operations in regular expressions. Below is a silly regex that demonstrates this concept. The regex searches for patterns that begin with m, end with n, and have an even number of o’s or u’s in between. We substitute each match with an X to show that it found the correct occurrences.
re.sub("m((uu)+|(oo)+)n", "X",
"the moon is not mon but is muuuun and not muuun nor muuoon" )
'the X is not mon but is X and not muuun nor muuoon'
Take a close look at this regex; the outermost parenthesis accomplish the constraint that the pattern begins with m and ends with n. Within these parentheses we are looking for the pattern to the left or the one on the right. The pattern on the left must consist of one or more double u’s, and the one of right matches one or more double o’s. This is equivalent to the regex
re.sub("m(uu(uu)*|oo(oo)*)n", "X",
"the moon is not mon but is muuuun and not muuun nor muuoon" )
'the X is not mon but is X and not muuun nor muuoon'
We have just seen how parentheses can be used to specify the order of operations. Parentheses have another meaning: every set of parentheses specifies a regex group, which allows us to identify subpatterns in a string. When a pattern contains regex groups, re.findall
returns a list of tuples that contain the subpattern contents. For example, recall the task of extracting from a web log the day, month and year. The following regex creates three regex groups, one for each of these.
pattern = "\[([0-9]{2})/([a-zA-z]{3})/([0-9]{4})"
re.findall(pattern, log_entry)
[('26', 'Jan', '2004')]
We have wrapped each digit group in parentheses. Below is a more compact regular expression that uses two shorthand names for character classes, \d
for digits and \w
for letters and numbers.
pattern = "\[(\d{2})/(\w{3})/(\d{4})"
re.findall(pattern, log_entry)
[('26', 'Jan', '2004')]
As promised, re.findall
returns a list of tuples containing the individual components of the date and time of the web log.
We have introduced alot of terminology in the subsections of this section, and in the next section, we bring it all together into a set of tables for easy reference.
13.3.4. Reference Tables¶
We conclude this section by collecting together into a few tables, order of operation, meta characters, and shorthands for character classes. Additionally, we provide tables summarizing the handful of methods in the re
Python library that we have used in this section. These tables are not meant to be an exhaustive collection, but with the concepts of literals, character classes, quantifiers, alternation, and grouping, and the specifics in these tables, you should be well equipped to wrangle text into usable data.
The four basic operations for regular expressions, concatenation, quantifying, alternation, and grouping have an order of precedence, which we make explicit in the table below.
Operation |
Order |
Example |
Matches |
---|---|---|---|
concatenation |
3 |
|
|
alternation |
4 |
|
|
quantifying |
2 |
|
|
grouping |
1 |
c(at)? |
|
The following table provides a list of the meta characters introduced in this section, plus a few more. The column labeled “Doesn’t Match” is meant to provide insight into their usage.
Char |
Description |
Example |
Matches |
Doesn’t Match |
---|---|---|---|---|
. |
Any character except \n |
|
abc |
ab |
[ ] |
Any character inside brackets |
|
car |
jar |
[^ ] |
Any character not inside brackets |
|
car |
bar |
* |
≥ 0 or more of previous symbol, shorthand for {0,} |
|
bbark |
dark |
+ |
≥ 1 or more of previous symbol,shorthand for {1,} |
|
bbpark |
dark |
? |
0 or 1 of previous symbol, shorthand for {0,1} |
|
she |
the |
{n} |
Exactly n of previous symbol |
|
hellooo |
hello |
| |
Pattern before or after bar |
|
we |
e |
\ |
Escape next character |
|
[hi] |
hi |
^ |
Beginning of line |
|
ark two |
dark |
$ |
End of line |
|
noahs ark |
noahs arks |
Additionally, we provide a table of shorthands for some commonly used character sets. Notice that these do not require the use of [ ]
to specify them.
Description |
Bracket Form |
Shorthand |
---|---|---|
Alphanumeric character |
|
|
Not an alphanumeric character |
|
|
Digit |
|
|
Not a digit |
|
|
Whitespace |
|
|
Not whitespace |
|
|
We used the following methods in re
in this section. The names of the methods are indicative of the functionality they perform: search for a pattern in a string; find all cases of a pattern in a string; substitute all occurrences of a pattern with a substring, and split a string into pieces at the pattern. Each, requires a pattern and string to be specified (sub
requires a replacement string as well), and some have additional arguments, which we do not document. The table below provides the format of the method usage and a description of the return value.
Method |
Return value |
---|---|
|
truthy match object if the pattern is found, otherwise |
|
list of all matches of |
|
string where all occurrences of |
|
list of the pieces of |
Regex and pandas
As seen in the previous section, pandas
Series objects have a .str
property that supports string manipulation using Python string methods. Conveniently, the .str
property also supports some functions from the re
module. The table below shows the analogous functionality from the above table of the re
methods. Each requires a pattern, and additional arguments are not documented. For the complete documentation on pandas
string methods, see https://pandas.pydata.org/pandas-docs/stable/text.html
Method |
Return value |
---|---|
|
Series of booleans indicating whether the |
|
list of all matches of |
|
Series with all matching occurrences of |
|
Series of lists of strings around given |
In the next section, we carry out a text analysis. We first clean the data using regular expressions and string manipulatio, then we convert the text into quantiative data, and analyse the text via these derivied quantitites.