2. Generalizing from Data

Data scientists use data to draw conclusions about our world. But our data are usually incomplete. For instance, election pollsters typically collect a few thousand people within their samples but there were over 150 million votes in the 2020 US election. How can we draw conclusions about large populations using small samples? In other words, how do we generalize from data?

Here’s a basic answer to this question: the more our sample “looks” like our population, the more confidence we have in generalizing from our sample. Making this judgement requires us to look closely at how our data were gathered before we write any code or fit any models. In this chapter, we’ll explain what to look for in a sample to determine whether we can use it to generalize to a population.