Summary

8.6. Summary

Data wrangling is an essential part of data analysis. Without it, we risk overlooking problems in data that can have major consequences for future analysis. This chapter covered an important first step in data wrangling: reading data from a plain text source file into a Python DataFrame. We introduced different types of file formats and encodings, and we wrote code that can read data from these formats. We checked the size of source files and considered alternative tools for working with large datasets.

We also introduced command-line tools as an alternative to Python for checking the format, encoding, and size of a file. These CLI tools are especially handy for filesystem-oriented tasks because of their simple syntax. In this chapter, we’ve only touched the surface of what CLI tools can do. In reality, the shell is capable of sophisticated data processing and is well worth learning.

Understanding the shape and granularity of a table gives us insight into what a row in a data table represents. This helps us determine whether the granularity is mixed, aggregation is needed, or weights are required. After looking at the granularity of your dataset, you should have answers to the following questions.

  • What does a record represent?

  • Do all records in a table capture granularity at the same level? Sometimes a table contains additional summary rows that have a different granularity.

  • If the data were aggregated, how was the aggregation performed? Summing and averaging are common types of aggregation.

  • What kinds of aggregations might we perform on the data?

Knowing your table’s granularity is a first step to cleaning your data, and it informs you of how to analyze your data. For example, we saw the granularity of the DAWN survey is an ER visit. That naturally leads us to think about comparisons of patient demographics to the US as a whole.

The wrangling techniques in this chapter help us bring data from a source file into a data frame and understand its structure. Once we have a data frame, further wrangling is needed to assess quality and prepare the data for analysis. We’ll cover this in the next chapter.