13.5. Summary

This chapter introduced techniques for working with text to clean and analyze data. Regular expressions / pattern matching is a powerful tool, but when boiled down to the basics you can start writing your own patterns quickly. Regular expressions are somewhat notorious for being difficult to read and debug. We close with some advice to help.

  • Develop your regular expression on simple test strings to see what the pattern matches.

  • If a pattern matches nothing, try weakening it by dropping part of the pattern. Then tighten it incrementally to see how the matching evolves. (Online regex checkers can be very helpful here).

  • Use raw strings whenever possible for cleaner patterns, especially when a pattern includes va backslash.

  • When you have many or long strings, consider using compiled patterns because they can be faster to match (see compile in the re library).

While powerful, regular expressions are terrible at certain types of problems. Don’t use them to:

  • Parse hierarchical structures such as JSON or HTML; use a parser instead.

  • Search for complex properties, like palindromes and balanced parentheses.

  • Validating a complex feature, such as a valid email address.

Despite these warnings, regular expressions are a handy tool in data science, and we hope you have fun with them.