5.4 Random Forests

Random Forests

Decision trees have a key weakness: they exhibit high variance, meaning small changes in the training data can produce very different trees. Bagging (Bootstrap Aggregating) partially addresses this issue by creating a large number of decision trees from different bootstrap samples of the training data and then averaging their predictions.

Random forests help with the variance issue even more by taking the bagging approach and adding one more tweak. To fit a random forest, start the same way as bagging: build a large number of decision trees using bootstrapped training samples. But each time a split in a tree is considered, a random sample of only \(m = \sqrt{p}\) predictors is chosen as candidates for the split (p is the total number of predictors). A fresh sample of m predictors is taken at each split.

How does this help with the variance issue? Suppose for example that there’s one very strong predictor in the data set, along with a number of other moderately strong predictors. Then for bagging, most or all of the trees will use this strong predictor in the top split. Consequently, all the bagged trees will look quite similar, so each tree’s predictors will be highly correlated. Averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. This is why bagging may not lead to a substantial reduction in variance, but random forests will.

Fit a Random Forest

Continuing from 5.3, take the boston data set. Use randomForest to fit a random forest, with mtry set to 4 (a random sample of just 4 predictors are considered for each split in each tree).

Which explanatory variables were found to be most important according to the random forest?
Which explanatory variables were not found to be important?
Compare your answers for A and B to the explanatory variable importance you found with bagging (5.3). Do you find the same variables are most and least important?
Which suburb in the test data set is predicted to have the highest median value? Why is it predicted to have such a high median value?
Which suburb is predicted to have the lowest median value? Why is it predicted to have such a low median value?
Find the mean squared error (MSE) for the test data set, and report the MSE for the regression tree (in 5.3), the bagging approach (5.3), and the random forest. Does bagging offer improvements over using one regression tree, and does random forests offer improvements over bagging?