Data Scope

2. Data Scope

As data scientists we use data to answer questions, and the quality of the data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general approach for understanding data collection and evaluating the usefulness of the data in addressing the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is a population characteristic, a physical model, or some type of social behavior. Typically, the data do not contain complete information (the scope is restricted in some way), yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong.

To motivate the need to think about these issues, we begin with an example of the power of big data and what can go wrong (Section sec:scope_bigdata). We then provide a framework that can help you connect the goal of your study with the data collection process. We refer to this as the scope of data. Section sec:scope_construct and … provide a terminology to help describe the data scope, and provide examples from surveys, government data, scientific instruments, and online resources. Later, in Section sec:scope_accuracy, after we have identified issues with scope, we consider what does it mean for data to be accurate. There, we introduce different forms of bias and variation, and describe conditions where they can arise. Throughout, the examples cover the spectrum of the sorts of data that you may be using as a data scientists; they are from science, political elections, public health, and online communities.