2. Data Scope

As data scientists we gather data to answer questions, and the quality of this data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general construct for understanding data collection and evaluating the usefulness of the data to address the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is, e.g., a population characteristic, a physical model, or some type of social behavior. Typically, the data do not contain complete information, i.e., the scope is restricted in some way, yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong.