Understanding the Data

1.3. Understanding the Data

After obtaining data, we want to understand the data we have. A key part of understanding the data is doing exploratory data analysis, where we often create plots to uncover interesting patterns and summarize the data visually. We also look for problems in the data. Most real-world datasets have missing values, weird values, or other anomalies that we need to account for.

In our experience, this stage of the lifecycle is highly iterative. Understanding the data can lead to any of the other stages in the data science lifecycle. As we understand the data more, we often revise our research questions, or realize that we need to get data from a different source.

This stage incorporates both programming and statistical knowledge. To manipulate data, we write programs that clean data, transform data, and create plots. To find patterns and trends in the data, we use summary statistics and statistical models.

When our research questions are purely exploratory, we are only concerned about patterns in the data. In these cases, our analysis can end at this stage of the lifecycle. When our research questions are inferential or predictive, however, we proceed to the next stage of the lifecycle: understanding the world.