Linear Models

15. Linear Models

At this point in the book, we’ve covered the first three stages of the data science lifecycle. We’ve talked about asking questions and obtaining data. We’ve also used exploratory data analysis (Chapter 10) and simple models (Chapter 4) to understand the data. Now, we’ll dive deeper into statistical models, which will also lead us into the last stage of the lifecycle—understanding the world. In this chapter, we’ll introduce linear models, which let us model relationships between variables for the first time in this book.

Simply put, linear models estimate how much our measurements vary together. Being able to model relationships opens the door to all kinds of useful data analyses. We can use these models to make predictions—for example, data scientists use linear models to predict the future sales of a product based on past trends. We can also use these models to make inferences about how well one variable predicts the outcome. We saw an example of this in Chapter 12, where we wanted to know how cheap air quality sensor readings were related to expensive ones. In that case study, understanding how the two measurements varied together enabled us to calibrate the cheap sensors.

In this chapter, we’ll start by explaining the simple linear model, which models the relationship between a single predictor variable and the outcome. We’ll explain how to fit this model using the loss minimization approach first introduced in Chapter 4. We’ll then introduce the multiple linear model, which models the relationship between multiple predictor variables and the outcome. To fit this model, we’ll use linear algebra to reveal the geometry of the modeling problem. Finally, we’ll go over feature engineering techniques that let us use linear models to model all sorts of relationships, even nonlinear ones.