Exploratory Data Analysis
10. Exploratory Data Analysis¶
John Tukey, author of the influential book, Exploratory Data Analysis [Tukey, 1977], avidly promoted an alternative type of data analysis that broke from the formal world of confidence intervals, hypothesis tests, and modeling. Exploratory Data Analysis (EDA) is now a popular approach to data analysis and considered good practice, when done correctly. Tukey describes Exploratory Data Analysis (EDA) as a philosophical approach to working with data:
EDA is “an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.”
This is a deviation from the tradition of proposing a hypothesis before looking at the data, testing the hypothesis on the data, and making a decision based on the p-value of the test. Instead, EDA is a creative search for the unexpected using simple summary statistics and visualizations. According to Tukey, “EDA is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.”
As a data scientist, you will want to use EDA in every stage of the data life cycle from checking the quality of your data to preparing the data for formal modeling to confirming your model is reasonable. Indeed, the work described in Section X to clean and transform the data relied heavily on EDA—we couldn’t have known what to clean without EDA.
In an EDA-type investigation, we enter a process of discovery, constantly asking questions, and diving into uncharted territory to explore ideas. We use plots to uncover features of the data, examine distributions of values, and reveal relationships that cannot be detected from simple numerical summaries. This exploration involves transforming, visualizing, and summarizing data to build and confirm our understanding, identify and address potential issues with the data, and inform subsequent analysis. EDA is creative and fun! And, it takes practice. One of the best ways to learn how to carry out an exploratory data analysis is to learn from others as they describe their thought process while they carry out an EDA, and there are many online sources to help.
But, while EDA can provide valuable insights, you need to be cautious about the conclusions that you draw. It is important to recognize that EDA can bias your view. The analysis is a winnowing process and a decision-making process that can impact the replicability of your later, model-based findings. With enough data, if you look hard, you can dredge up something interesting that is entirely spurious. The role of EDA in the scientific reproducibility crisis has been noted, and data scientists have cautioned against overdoing it. For example, Gelman and Loken note [Gelman and Loken, 2017]:
even in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons [data dredging] emerges because different choices about combining variables, inclusion and exclusion of cases, transformations of variables, tests for interactions in the absence of main effects, and many other steps in the analysis could well have occurred with different data.
It’s good practice to report and provide the code from your EDA so that others are aware of the choices that you made and the paths you took in analyzing your data.
American Kennel Club. In this chapter, we use the American Kennel Club (AKC) data on registered dog breeds to introduce the various concepts related to EDA. The American Kennel Club (akc.org), a non-profit that was founded in 1884, has the stated mission to “advance the study, breeding, exhibiting, running and maintenance of purebred dogs.” The AKC organizes events like its National Championship, Agility Invitational, and Obedience Classic, and mixed breed dogs are welcome to participate in most events. Information is Beautiful (informationisbeautiful.net) provides a dataset with information from AKC on 172 breeds, and their visualization, incorporates many features of the breeds and is fun to examine.
The AKC dataset contains several different kinds of features, and we have extracted a handful of them that show the variety of types of information that might be available in a dataset. These features include the name of the breed, its longevity, weight, and height, and other information such as its suitability for children and the number of repetitions needed to learn a new trick. Each record in the dataset is a breed of dog, and the information provided is meant to be typical of that breed.
The EDA process typically involves creating simple visualizations. To do this, we need to choose an appropriate visualization for a feature, and our choice depends on the kind of data that have been collected. This mapping of plot type to feature type is the topic of Section 9.1. From there, we go on to describe how to “read” a plot, what to look for, and how to interpret what you see. Section 9.2 discusses what to look for in a one-variable plot, Section 9.3 focusses on reading relationships between two variables, and Section 9.4 describes plots for three or more variables. After that we present guiding questions for carrying out an EDA (Section 9.5) and walk through an example as we follow these guidelines (Section 9.6).
The topic of visualization is split between this chapter, Chapter 9 and Chapter 11). In this chapter, we usually take the default parameter settings of the plotting functions. In the visualization chapter, we provide guidelines for making effective and informative plots and give advice on how to make your visual argument clear and compelling.