10. Exploratory Data Analysis#

More than 50 years ago, John Tukey avidly promoted an alternative type of data analysis that broke from the formal world of confidence intervals, hypothesis tests, and modeling, and today Tukey’s Exploratory Data Analysis (EDA) is widely practiced. Tukey describes EDA as a philosophical approach to working with data:

Exploratory data analysis is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.

As a data scientist, you will want to use EDA in every stage of the data lifecycle from checking the quality of your data to preparing for formal modeling to confirming that your model is reasonable. Indeed, the work described in Chapter 9 to clean and transform the data relied heavily on EDA to guide our quality checks and transformations.

In EDA, we enter a process of discovery, continually asking questions and diving into uncharted territory to explore ideas. We use plots to uncover features of the data, examine distributions of values, and reveal relationships that cannot be detected from simple numerical summaries. This exploration involves transforming, visualizing, and summarizing data to build and confirm our understanding, identify and address potential issues with the data, and inform subsequent analysis.

EDA is fun! But it takes practice. One of the best ways to learn how to carry out an EDA is to learn from others as they describe their thought process while they explore data, and we attempt to reveal EDA thinking in our examples and case studies in this book.

EDA can provide valuable insights, but you need to be cautious about the conclusions that you draw. It is important to recognize that EDA can bias your analysis. EDA is a winnowing process and a decision-making process that can impact the replicability of your later, model-based findings. With enough data and if you look hard, you often can dredge up something interesting that is entirely spurious.

The role of EDA in the scientific reproducibility crisis has been noted, and data scientists have cautioned against overdoing it. For example, Gelman and Loken note:

Even in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons [data dredging] emerges because different choices about combining variables, inclusion and exclusion of cases, transformations of variables, tests for interactions in the absence of main effects, and many other steps in the analysis could well have occurred with different data.

It’s a good practice to report and provide the code from your EDA so that others are aware of the choices you made and the paths you took in learning about your data.

The topic of visualization is split across three chapters. In Chapter 9, we used plots to inform us in our data wrangling. The plots there were basic and the findings straightforward. We didn’t dwell on interpretations and choices of plots. In this chapter, we spend more time learning how to choose the right plot and interpret it. We usually take the default parameter settings of the plotting functions since our goal is to make plots quickly as we carry out EDA. In Chapter 11, we’ll provide guidelines for making effective and informative plots and give advice on how to make our visual argument clear and compelling.

According to Tukey, visualization is central to EDA:

The greatest gains from data come from surprises… The unexpected is best brought to our attention by pictures.

To make these pictures, we need to choose an appropriate type of plot, and our choice depends on the kinds of data that have been collected. This mapping between feature type and plot choice is the topic of the next section. From there, we go on to describe how to “read” a plot, what to look for, and how to interpret what you see. We first discuss what to look for in a one-feature plot, then focus on reading relationships between two features, and finally describe plots for three or more features. After we have introduced the visualization tools for EDA, we provide guidelines for carrying out an EDA and then walk through an example as we follow these guidelines.