3.4 KNN for Classification

Part 1: Reflecting on 3.3

  1. In assignment 3.3, we simulated data on commuting choice based on daily temperature. We found that the Bayes error rate (the theoretical minimum error for the classification task) was ____.
  1. We also found that the multinomial logit model had an error rate of ____.

Part 2: KNN for the commuting choice classification task

Data Generation

  1. We’ll use the same data generation function from assignment 3.3 to simulate commuting choices based on temperature. The function generates probabilities for choosing to walk, bike, or drive, depending on the temperature.
library(tidyverse)

# generate_data <- function(n) {
#    ___
# }

Create Training and Test Sets

  1. Generate a training set with 1,000 observations and a test set with another 1,000 observations.
# set.seed(1234)
# train <- generate_data(1000)
# test <- generate_data(1000)

Implement KNN from Scratch

Look back at assignment 2.8 to recall how we implemented KNN from scratch.

  1. First, you’ll need to create a helper function most_frequent that takes a character vector and returns the most frequent class. For example, most_frequent(c("walk", "drive", "walk")) should return “walk”. If there are ties, your function should choose one at random. So most_frequent(c("bike", "drive", "drive", "bike")) should return “bike” sometimes and “drive” other times.

To do this, I used functions like tibble, count, slice_max, slice_sample, and pull to take a column in a tibble and return a vector.

# most_frequent <- function(character_vector) {
#   ___
# }

Applying KNN

  1. Apply KNN to the test set using the training set to find the nearest neighbors.
# k <- ___

# knn_estimates <- test %>%
#   mutate(
#     yhat = map_chr(
#       pull(test, temp),
#       function(temp_test) {
#         ___
#       }
#     )
#   )

# knn_estimates %>%
#   mutate(
#     error = if_else(choice == yhat, 0, 1)
#   ) %>%
#   summarize(knn_error = mean(error))

Answer:

Testing Different Values of K

  1. Test different values of k to see which one yields the lowest mean error.
k error
10
30
50
70
90
100
120
150
200

Bias-Variance Tradeoff

The choice of k in KNN has a significant impact on the model’s performance. This is due to the bias-variance tradeoff, a fundamental concept in machine learning that describes the tension between two sources of error in predictive models:

  1. Bias: This is the error introduced by overly simplistic assumptions in the model. A high-bias model tends to underfit the data, meaning it fails to capture the underlying patterns. In KNN, a large value of k (e.g., k = 200) leads to a smoother decision boundary that may not capture the true structure of the data, resulting in high bias.

  2. Variance: This is the error introduced by the model’s sensitivity to small fluctuations in the training data. A high-variance model tends to overfit the data, meaning it captures noise as if it were a true pattern. In KNN, a small value of k (e.g., k = 1) leads to a highly flexible decision boundary that fits the training data very closely but may perform poorly on new, unseen data.

Example: K = 1 vs. K = 200

  • When k = 1, the model is very flexible and adapts to every data point in the training set. This results in low bias but high variance, as the model is highly sensitive to noise in the data.
  • When k = 200, the model is much less flexible and produces a smoother decision boundary. This results in low variance but high bias, as the model may oversimplify the true patterns in the data.

In our simulated dataset, neither extreme (k = 1 or k = 200) performs well. The best performance is achieved at an intermediate value of k, which balances bias and variance.

Part 3: Imperative Programming Practice

Complete Project Euler Problem 8: Largest Product in a Series.