Theory for Inference and Prediction

16. Theory for Inference and Prediction

When you want to generalize your findings beyond the descriptive and beyond the current collection of data to a larger setting, your data needs to be representative of that larger world. For example, you may want to predict air quality at a future time based on a sensor reading (Chapter 12). Or, you may want to test whether an incentive improves the productivity of contributors based on experimental findings (Chapter 3), or construct an interval estimate for the amount of time you spend waiting for a bus (Chapter 5). We have touched on all of these scenarios in earlier chapters, and now, in this chapter, we formalize the frameworks for predictions and inferences.

At the core of these frameworks is the notion of a distribution, be it a population, empirical (aka sample), or probability distribution. Understanding the connections between these notions leads to basics of hypothesis testing, confidence intervals, prediction bans and risk. We begin with a brief review of the urn model, first introduced in Chapter 3, then introduce formal definitions of hypothesis tests, confidence intervals, and prediction bands. We use simulation in examples, including introducing the bootstrap, a special case of simulation. We wrap up the chapter with formal definitions of expectation, variance, and introduce the variance – bias decomposition, an essential tool to understanding risk, regularization and over fitting.