Wrangling Dataframes

9. Wrangling Dataframes

We often need to perform preparatory work on our data before we can begin our analysis. The amount of preparation can vary widely, but there are a few basics for moving from raw data to data ready for analysis. The Wrangling Files chapter addressed the initial steps of creating a data frame from a plain text source. Next, we assess quality. We perform validity checks on individual data values and features. In addition to checking the quality of the data, we learn whether or not the data need to be transformed and reshaped to get ready for analysis. Quality checking (and fixing) and transformation are often cyclical: the quality checks point us toward transformations we need to make, and we check the transformed features to confirm that our data are ready for analysis or need further cleaning and transforming.

Depending on the data source, we often have different expectations for quality. Some datasets require extensive wrangling to get them into an analyzable form, and other datasets arrive clean and we can quickly launch into modeling. Below are some examples of data sources and how much wrangling we might expect to do.

  • Data from a scientific experiment or study are typically clean, well-documented, and have a simple structure. These data are organized to be broadly shared so that others can build on and reproduce the findings. They are typically ready for analysis after little to no wrangling.

  • Data from government surveys often come with very detailed codebooks and meta data describing how the data are collected and formatted, and these datasets are also typically ready for exploration and analysis.

  • Administrative data can be clean, but without inside knowledge of the source we are often need to extensively check quality. Also, since we are typically using these data for a purpose other than the reasons for collecting it, we usually need to transform features and combine data tables.

  • Informally collected data, such as data scraped from the Web, can be quite messy and tends to come with little documentation. For example, texts, tweets, blogs, Wikipedia tables, etc. usually require formatting and cleaning to transform them into quantitative information that is ready for analysis.

In this chapter, we break down data wrangling into the following stages: assess data quality; transform features; and reshape the data by modifying its structure and granularity. An important step in assessing the quality of the data is to consider its scope. Data scope was covered in the Questions and Data Scope chapter, and we refer you to that chapter for a fuller treatment of the topic.

To clean and prepare data, we also rely on visualizations and exploratory data analysis. In this chapter, however, we’ll focus on data wrangling and cover the other related topics in more detail in the Exploratory Data Analysis and Data Visualization chapters.

We begin by introducing data wrangling concepts through an example in the next section.