library(tidyverse)
set.seed(1234)
<- read_csv("https://raw.githubusercontent.com/cobriant/320data/refs/heads/master/Housing.csv") %>%
housing mutate(train = sample(c(0, 1), size = 545, replace = T))
<- housing %>%
house_train filter(train == 1)
<- housing %>%
house_test filter(train == 0)
2.6 Multiple Regression, Model Selection, and Prediction
In this project, you will explore how different characteristics of houses can predict their prices. Using a data set of luxury homes in California, you’ll learn how to:
- Build and select models using a systematic approach
- Interpret results for a multiple regression
- Make and evaluate predictions
In this project, you will use a data set on very expensive houses in California to answer the question, “what house characteristics predict price?”
Run this code to get started:
Forward Selection Model Building
When building a multiple regression model, we need a systematic way to choose which predictors to include. There are several approaches:
- Forward selection: Start with no predictors and add them one by one
- Backward selection: Start with all predictors and remove them one by one
- Mixed approach: Combination of adding and removing predictors
We’ll practice using forward selection in this assignment. These methods helps us avoid p-hacking (the practice of manipulating data analysis to find statistically significant patterns) by following a pre-defined process.
a) Create a Helper Function
First, we need a function to extract the adjusted R-squared value from our models. Complete the following function:
# adj_r_squared <- function(lm_obj) {
# lm_obj %>%
# broom::glance() %>%
# pull(_____) # What column name gives us adjusted R-squared?
# }
b) Forward Selection Process
We’ll build our model following these steps:
- Find the single predictor that gives the highest adjusted R-squared
- Add one predictor at a time if it improves adjusted R-squared by at least 0.01
- Stop when no additional predictor meets this threshold
Complete the following code to implement forward selection for this example:
# Get all potential predictor variables
<- house_train %>% select(-price, -train) %>% names()
expl_vars
# Step 1: Find best single predictor
# map_dbl(
# set_names(expl_vars),
# ~ lm(reformulate(_____, "price"), house_train) %>% adj_r_squared()
# ) %>%
# sort(decreasing = T)
# The best predictor was _____ with adjusted R-squared = _____
# Step 2: Find best second predictor
# map_dbl(
# set_names(expl_vars),
# ~ lm(reformulate(c(.x, "_____"), "price"), house_train) %>% adj_r_squared()
# ) %>%
# `+`(-0.3269647108) %>% # Why do we subtract this number?
# sort(decreasing = T)
# The next best predictor was _____, which added _____ to the adjusted R squared
# Continue this process until no predictors add .01 to the adjusted R squared...
c) After completing the forward selection process, we end up with this model:
# lm(price ~ ___, data = house_train) %>%
# broom::tidy()
d) Which variables were included in the final model? Which were excluded?
e) For each coefficient in the final model, write an interpretation:
- If area increases by 1 square foot, the predicted price increases by $_____ on average, holding all else constant
- [Complete for other variables…]
Prediction
f) Now we’ll test how well our model predicts prices for houses it hasn’t seen before. Complete the code below:
# house_test %>%
# mutate(
# prediction = predict(lm(_____, data = ____), newdata = _____),
# test_error = (_____)^2
# ) %>%
# summarize(mean_test_error = sqrt(mean(_____)))
g) Interpret your answer to question f: what is the average prediction error for our model?
h) How does this compare to simply predicting the median price for all houses? What does this tell us about the usefulness of our model?
lm(price ~ 1, data = house_train) %>%
::tidy() broom
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4925400. 122697. 40.1 5.29e-117
# house_test %>%
# mutate(
# prediction = predict(lm(____, data = ____), newdata = ____),
# test_error = (____)^2
# ) %>%
# summarize(mean_test_error = sqrt(mean(____)))