2.6 Multiple Regression, Model Selection, and Prediction

In this project, you will explore how different characteristics of houses can predict their prices. Using a data set of luxury homes in California, you’ll learn how to:

Build and select models using a systematic approach
Interpret results for a multiple regression
Make and evaluate predictions

In this project, you will use a data set on very expensive houses in California to answer the question, “what house characteristics predict price?”

Run this code to get started:

library(tidyverse)

set.seed(1234)
housing <- read_csv("https://raw.githubusercontent.com/cobriant/320data/refs/heads/master/Housing.csv") %>%
  mutate(train = sample(c(0, 1), size = 545, replace = T))

house_train <- housing %>%
  filter(train == 1)

house_test <- housing %>% 
  filter(train == 0)

Forward Selection Model Building

When building a multiple regression model, we need a systematic way to choose which predictors to include. There are several approaches:

Forward selection: Start with no predictors and add them one by one
Backward selection: Start with all predictors and remove them one by one
Mixed approach: Combination of adding and removing predictors

We’ll practice using forward selection in this assignment. These methods helps us avoid p-hacking (the practice of manipulating data analysis to find statistically significant patterns) by following a pre-defined process.

a) Create a Helper Function

First, we need a function to extract the adjusted R-squared value from our models. Complete the following function:

# adj_r_squared <- function(lm_obj) {
#   lm_obj %>% 
#     broom::glance() %>% 
#     pull(_____) # What column name gives us adjusted R-squared?
# }

b) Forward Selection Process

We’ll build our model following these steps:

Find the single predictor that gives the highest adjusted R-squared
Add one predictor at a time if it improves adjusted R-squared by at least 0.01
Stop when no additional predictor meets this threshold

Complete the following code to implement forward selection for this example:

# Get all potential predictor variables
expl_vars <- house_train %>% select(-price, -train) %>% names()

# Step 1: Find best single predictor
# map_dbl(
#   set_names(expl_vars), 
#   ~ lm(reformulate(_____, "price"), house_train) %>% adj_r_squared()
#   ) %>%
#   sort(decreasing = T)

# The best predictor was _____ with adjusted R-squared = _____

# Step 2: Find best second predictor
# map_dbl(
#   set_names(expl_vars), 
#   ~ lm(reformulate(c(.x, "_____"), "price"), house_train) %>% adj_r_squared()
# ) %>%
#   `+`(-0.3269647108) %>%  # Why do we subtract this number?
#   sort(decreasing = T)

# The next best predictor was _____, which added _____ to the adjusted R squared

# Continue this process until no predictors add .01 to the adjusted R squared...

c) After completing the forward selection process, we end up with this model:

# lm(price ~ ___, data = house_train) %>%
#   broom::tidy()

d) Which variables were included in the final model? Which were excluded?

e) For each coefficient in the final model, write an interpretation:

If area increases by 1 square foot, the predicted price increases by $_____ on average, holding all else constant
[Complete for other variables…]

Prediction

f) Now we’ll test how well our model predicts prices for houses it hasn’t seen before. Complete the code below:

# house_test %>%
#   mutate(
#     prediction = predict(lm(_____, data = ____), newdata = _____),
#     test_error = (_____)^2
#     ) %>%
#   summarize(mean_test_error = sqrt(mean(_____)))

g) Interpret your answer to question f: what is the average prediction error for our model?

h) How does this compare to simply predicting the median price for all houses? What does this tell us about the usefulness of our model?

lm(price ~ 1, data = house_train) %>%
  broom::tidy()

# A tibble: 1 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) 4925400.   122697.      40.1 5.29e-117

# house_test %>%
#   mutate(
#     prediction = predict(lm(____, data = ____), newdata = ____),
#     test_error = (____)^2
#     ) %>%
#   summarize(mean_test_error = sqrt(mean(____)))