3.3 Bayes Classifier and Multinomial Logit Model

In this assignment, you will:

Understand the Bayes classifier and Bayes error rate.
Simulate data and compute the Bayes error rate.
Compare the Bayes classifier to a multinomial logit model in a classification task.

Bayes Classifier and Bayes Error Rate

The Bayes classifier answers the question, “How well could we perform a classification task in theory?” That is, even if we knew the true data generating process (someone with explanatory variable \(X = x\) has a known probability \(P(Y = 1 | X = x)\) of belonging to class 1), we would still make errors because of the inherent randomness in the outcome: the outcome with the highest probability isn’t always the one that occurs. The Bayes classifier minimizes the probability of misclassification by always predicting the class with the highest (conditional) probability given the data. For example, if \(P(Y = 1 | X = x) = 0.7\), the Bayes classifier predicts class 1, but there is still a 30% chance the outcome was not class 1. The Bayes error rate is the minimum possible error rate achievable by any classifier, reflecting the irreducible uncertainty in the data. It serves as a theoretical benchmark for evaluating the performance of practical classification methods.

Simulate Data. Fill in the blanks to create a function that generates simulated data where the probability of walking to work, biking to work, or driving to work depends on the temperature outside. Guidelines:

Create a function generate_data that takes as an argument how many observations you want to generate, and returns a tibble with n observations and 4 variables: temp, walk_prob, bike_prob, and choice.
temp should be random uniform with a minimum of 18 and a maximum of 100.
bike_prob should use case_when: when the temperature is 70 degrees or above, you bike with probability (w.p.) 0.5. When the temperature is 45 degrees or above, you bike w.p. 0.3. Otherwise, you bike w.p. 0.1.
Generate walk_prob in a similar way: when the temperature is 70 degrees or above, you walk with probability (w.p.) 0.3. When the temperature is 45 degrees or above, you walk w.p. 0.6. Otherwise, you walk w.p. 0.1.
Use map2_chr to generate choice based on walk_prob and bike_prob (drive_prob is 1 - walk_prob - bike_prob). map2_chr takes as arguments a vector .x, another vector .y, and a function .f, which gets applied to each element of .x and .y like this: f(.x[1], .y[1]), f(.x[2], .y[2]), ..., f(.x[n], .y[n]). In this way, choice can depend on walk_prob and bike_prob using the function sample().

library(tidyverse)

# generate_data <- function(n) {
#   ___
# }

Create a data set commute and answer the question, How many times did someone choose to walk, bike, or drive? Make sure observations for each of these choices exist, otherwise you should edit your generate_data function.

# commute <- generate_data(1000)

# commute %>% 
#   ___(___)

Visualize choice against temp (use geom_jitter or geom_violin or something similar).

Compute the Bayes Error Rate

The Bayes classifier predicts the choice with the greatest probability. Compute the Bayes error rate, which is the average misclassification error when using the true probabilities. Reflect: what does the Bayes error rate tell you about the difficulty of this classification problem? Hint: I used pmax() to find the maximum out of walk_prob, bike_prob, and drive_prob for each observation.

Train and Test a Multinomial Logistic Regression

5a) Create a training data set and a test data set with 1000 observations each. I’ll give you the code to do this: just uncomment and run.

# set.seed(1234)
# train <- generate_data(1000)
# test <- generate_data(1000)

5b) Use the training data set to fit the multinomial logit and interpret the results. (You’ll need to run install.packages(nnet), but don’t include any lines in your document that install packages as that will prevent rendering of the document to html: you want to install a package once, not over and over.)

# library(nnet)
# 
# model <- train %>% 
#   multinom(choice ~ temp, data = .)
# 
# print(model)

Interpretation: To talk about the probability of driving instead of the log odds of driving, we can use the formula \(\frac{exp(\eta_{drive})}{1 + exp(\eta_{drive}) + exp(\eta_{walk}))}\) where \(\eta_{drive}\) is equal to \(\beta_0 + \beta_1 \text{temp}\) for driving. So the estimated probability you drive when the temperature is 30 is estimated to be ___ and the estimated probability you drive when the temperature is 80 is ___.

5c) Compute the mean error rate for the multinomial logit model and compare that to the bayes error rate. Hint: use model with predict and newdata.

Imperative programming practice: Complete Project Euler problem 3: largest prime factor.