5.3 Bagging

Load a new data set Boston containing housing values in 506 suburbs of Boston. Our dependent variable to predict will be medv: median value of owner-occupied homes (measured in thousands). Here are the independent variables:

library(tidyverse)
library(MASS)
attach(Boston)
set.seed(1234)
boston <- as_tibble(Boston) %>%
  mutate(train = sample(c(0, 1), prob = c(1/3, 2/3), replace = T, size = 506))
  1. Use tree to create a decision tree for the training set.
library(tree)
# train <- ____
# boston.tree <- tree(medv ~ ., data = train)
# 
# boston.tree
# plot(boston.tree)
# text(boston.tree)
  1. Find the mean squared error for the tree using the test data.
# test <- ___

# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)
  1. Observe the high variance of tree:

Run this code a couple of times to see how much the tree changes based on the training data.

# boston.tree <- train %>% 
#   slice_sample(n = 150) %>%
#   tree(medv ~ ., data = .)
# 
# boston.tree
# plot(boston.tree)
# text(boston.tree, pretty = 0)

# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)

Decision trees are known to have a very high variance, meaning that different training sets can produce significantly different decision trees. To address this issue, bagging (bootstrap aggregating) is used to reduce variance by averaging the predictions of multiple decision trees. The process involves taking many training sets from the population, building a separate prediction model for each set, and averaging the resulting predictions. However, since we generally do not have access to multiple training sets, bootstrapping is employed instead. This involves taking repeated samples from the single available training set. By constructing B regression trees using B bootstrapped training sets and averaging their predictions, the variance is reduced. These trees are grown deep and not pruned, resulting in each tree having high variance but low bias. Averaging the B trees helps to mitigate the variance, leading to a lower test error rate compared to using a single tree. Typically, B might be set to a large number like 100. While bagging improves prediction accuracy, it comes at the expense of interpretability, making it difficult to interpret the resulting model. Nonetheless, the importance of predictors can be assessed by recording the total amount by which the residual sum of squares (RSS) is decreased due to splits over a given predictor, averaged over all B trees. A large value in this context indicates an important predictor.

Bagging with randomForest

install.packages("randomForest")
# library(randomForest)
# set.seed(1234)
# bag.boston <- randomForest(medv ~ ., data = train, mtry = 12, importance = T)
# 
# bag.boston
# 
# importance(bag.boston)

mtry = 12 means all 12 predictors should be considered for each split in the tree.

  1. Find the mean squared error for the bagging approach using the test data.
# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)