Asking a Question

1.1. Asking a Question

Asking questions lies at the heart of data science because different kinds of questions require different kinds of analyses. For example, “How have house prices changed over time?” is very different from “How will this new law affect house prices?”. Understanding our research question tells us what data we need, the patterns to look for, and how we should interpret our results. In this book, we focus on three broad categories of questions: exploratory, inferential, and predictive.

Exploratory questions aim to find out information about the data that we have. For example, we can use environmental data to ask: have average global temperatures risen in the past 40 years? The key part of an exploratory question is that it aims to summarize and interpret trends in the data without quantifying whether these trends will hold in data that we don’t have. “How many people voted in the last election?” is an exploratory question. “How many people will vote in the next election?” is not an exploratory question.

Inferential questions, on the other hand, do quantify whether trends found in our data will hold in unseen data. Let’s say we data from a sample of hospitals across the US. We can ask whether air pollution is correlated with lung disease for the individuals in our sample—this is an exploratory question. We can also ask whether air pollution is correlated with lung disease for the entire US—this is an inferential question, since we’re using our sample to infer a correlation for the entire US.

Note

Be careful not to confuse an inferential question with a question about causality. An inferential question asks whether a correlation exists. “Are people who are exposed to more air pollution more likely to develop lung disease?” is an inferential question. “Does air pollution cause lung disease?” is causal, not inferential. We typically cannot answer causal questions unless we have a randomized experiment (or assume one).

Predictive questions, like inferential questions, aim to quantify trends for unseen data. While inferential questions look for trends in the population, predictive questions aim to make predictions for individuals. An inferential question could ask: “What factors increase voter turnout in the US?” A predictive question could instead ask: “Given a person’s income and education, how likely are they to vote?”

As we do a data analysis, we often change and refine our research questions. Each time we do so, it’s important to consider what kind of question we want to answer. For a more detailed breakdown of the types of research questions, see [Leek and Peng, 2015].

In the next section, we’ll talk about how our question affects the data we want to obtain.