5.3 Bagging

Load a new data set Boston containing housing values in 506 suburbs of Boston. Our dependent variable to predict will be medv: median value of owner-occupied homes (measured in thousands). Here are the independent variables:

crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 square feet
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (1 if the tract bounds the river; 0 if not)
nox: nitrogen oxides concentration (in parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted mean of distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property tax rate per $10,000
ptratio: pupil-teacher ratio by town
lstat: lower status of the population (percent)

library(tidyverse)
library(MASS)
attach(Boston)
set.seed(1234)
boston <- as_tibble(Boston) %>%
  mutate(train = sample(c(0, 1), prob = c(1/3, 2/3), replace = T, size = 506))

Use tree to create a decision tree for the training set.

library(tree)
# train <- ____
# boston.tree <- tree(medv ~ ., data = train)
# 
# boston.tree
# plot(boston.tree)
# text(boston.tree)

Find the mean squared error for the tree using the test data.

# test <- ___

# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)

Observe the high variance of tree:

Run this code a couple of times to see how much the tree changes based on the training data.

# boston.tree <- train %>% 
#   slice_sample(n = 150) %>%
#   tree(medv ~ ., data = .)
# 
# boston.tree
# plot(boston.tree)
# text(boston.tree, pretty = 0)

# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)

Decision trees are known to have a very high variance, meaning that different training sets can produce significantly different decision trees. To address this issue, bagging (bootstrap aggregating) is used to reduce variance by averaging the predictions of multiple decision trees. The process involves taking many training sets from the population, building a separate prediction model for each set, and averaging the resulting predictions. However, since we generally do not have access to multiple training sets, bootstrapping is employed instead. This involves taking repeated samples from the single available training set. By constructing B regression trees using B bootstrapped training sets and averaging their predictions, the variance is reduced. These trees are grown deep and not pruned, resulting in each tree having high variance but low bias. Averaging the B trees helps to mitigate the variance, leading to a lower test error rate compared to using a single tree. Typically, B might be set to a large number like 100. While bagging improves prediction accuracy, it comes at the expense of interpretability, making it difficult to interpret the resulting model. Nonetheless, the importance of predictors can be assessed by recording the total amount by which the residual sum of squares (RSS) is decreased due to splits over a given predictor, averaged over all B trees. A large value in this context indicates an important predictor.

Bagging with `randomForest`

install.packages("randomForest")

# library(randomForest)
# set.seed(1234)
# bag.boston <- randomForest(medv ~ ., data = train, mtry = 12, importance = T)
# 
# bag.boston
# 
# importance(bag.boston)

mtry = 12 means all 12 predictors should be considered for each split in the tree.

Find the mean squared error for the bagging approach using the test data.

# test %>%
#   mutate(prediction = predict(___, newdata = test)) %>%
#   summarize(MSE = ___)

Bagging with randomForest

Bagging with `randomForest`