Linear Models

15. Linear Models

At this point in the book, we’ve covered the first three stages of the data science lifecycle in detail. We’ve talked about asking questions and obtaining data. We’ve also used exploratory data analysis (Chapter 10) to understand the data. and In this chapter we extend the simple model introduced in Chapter 4, to more complex linear models, which enable us to model relationships between variables and which leads us into the last stage of the lifecycle—understanding the world.

Simply put, linear models estimate how features vary together. Being able to model relationships opens the door to all kinds of useful data analyses. We can use these models to make predictions—for example, data scientists use linear models to predict air quality based on air sensor measurements and weather conditions (see Chapter 12). In that case study, understanding how the two measurements varied together enabled us to calibrate the cheap sensors. We can also use models to make inferences about the form of a relationship–for example, we use a linear model in Chapter 18 to infer the coefficients for length and girth are 1 and 2, respectively, in the model for a donkey’s weight: \(1 \times Length ~+ ~ 2 \times Girth ~-~175\). In that case study, the model enabled animal vets to prescribe medication for a sick donkey. And, models can help in exploring relationships and providing insights–for example, we can explore the relationships between factors correlated with upward mobility, such as residential segregation, income inequality, and the quality of K-12 education. In this chapter, we carry out this descriptive analysis, which has been used by researchers to study inter-generational mobility.

In this chapter, we start by describing the simple linear model, which models the relationship between a single explanatory variable and an outcome variable. We explain how to fit this model using the loss minimization approach first introduced in Chapter 4. Then, we introduce the multiple linear model, which models the relationship between multiple explanatory variables and the outcome. To fit this model, we use linear algebra to reveal the geometry of the modeling problem. Finally, we go over feature engineering techniques that let us include categorical features as explanatory variables, and to transform features to model all sorts of linear relationships.