Maximum Likelihood and Logit

In the classwork on “Fitting a Linear Model”, we learned that Least Squares is a method to find the line of best fit, where the estimate for the slope is \(\hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)}\) and the estimate for the y-intercept is \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\). Least squares is a method that works just for one type of model: a linear model.

Another more flexible method for estimating models is the method of maximum likelihood. It just asks, what is the most likely value for model parameters? We’ll explore this method in this assignment, using it twice:

But first, some warm-up questions.

Warm-up: key skills

Question 1

Consider the downward-facing parabola \(f(x) = -x^2 + 10x - 22\). Find the x value where the function is at its maximum by taking the first derivative and setting it equal to 0.

Taking the log of something turns exponents into a product and turns products into sums:

\[\log(x^5) = 5 \log(x)\]

\[\log(xy) = \log(x) + \log(y)\]

Question 2

Simplify \(\log(xy^3)\).

  1. \(3 \log(xy)\)
  2. \(3 \log(x) \log(y)\)
  3. \(3 \log(x) + 3\log(y)\)
  4. \(\log(x) + 3\log(y)\)

The derivative of \(\ln(x)\) is \(\frac{1}{x}\).

Question 3

Find the derivative of \(\log(1-x)\) (for this assignment, always assume \(\log\) is base \(e\)).

  1. \(1-x\)
  2. \(x-1\)
  3. \(\frac{1}{1-x}\)
  4. \(\frac{-1}{1-x}\)

Estimate the probability of success (Bernoulli random variable)

Here’s a simple example to illustrate how maximum likelihood works. It isn’t the fastest way to solve the problem, but it will walk you through each step of maximum likelihood estimation.

Suppose you have a coin that you suspect is not a fair coin: you flip it 20 times and get 15 heads and 5 tails. Let’s define the probability you get heads as \(p\), which makes the probability you get tails \(1-p\).

Question 4

Calculate the sample mean of success if a “heads” gives you 1 (success) and a “tails” gives you 0 (no success).

The likelihood of seeing 15 heads and 5 tails given probabilities \(p\) and \(1-p\) is the joint probability:

\[L(p) = p \times p \times p \times p \dots (1-p) \times (1-p) \times (1-p) \dots = p^{15} (1-p)^5\]

We’ll find the value of \(p\) where \(L\) reaches its maximum. But first, apply a log transformation to make the function simpler, turning products into sums. Taking the log of a function is a strictly increasing monotonic transformation, which means wherever \(L\) reaches its maximum, the log likelihood \(LL\) will also reach its maximum.

Question 5

Simplify \(LL = \log(p^{15} (1-p)^5)\).

  1. \(LL = \log(p^{15}) + \log((1-p)^5)\)
  2. \(LL = 15 \log(p) + 5 \log(1-p)\)
  3. \(LL = 15 \log(p) + \log(1-p)\)
  4. \(LL = \log(p)^{15} + \log(1-p)^5\)

Now, find the value of \(p\) which maximizes the log likelihood by taking the derivative of \(LL\) and setting it equal to zero.

Question 6

Take the derivative of \(LL\) from question 5.

  1. \(\frac{15}{p} + \frac{5}{1-p}\)
  2. \(15p - 5(1-p)\)
  3. \(\frac{15}{p} - \frac{5}{1-p}\)
  4. \(15p + 5(1-p)\)

Question 7

Set the derivative from question 6 equal to zero and solve for the value for \(p\) which maximizes the log likelihood \(LL\).

Compare this to the sample mean from question 4.

Using maxLik() for the Bernoulli example

We can also use the function maxLik() from the package maxLik to do the optimization step numerically instead of analytically.

  1. Install maxlik: install.packages("maxLik")
  2. maxLik needs:
  • a log likelihood function (15 * log(p) + 5 * log(1 - p))
  • a starting value (where the numerical search begins)

We’ll start our search at p = 0.5: the probability of success (heads) is 50%.

library(tidyverse)
library(maxLik)

maxLik(
  logLik = function(p) {
    15 * log(p) + 5 * log(1 - p)
  },
  start = 0.5
)
Maximum Likelihood estimation
Newton-Raphson maximisation, 2 iterations
Return code 1: gradient close to zero (gradtol)
Log-Likelihood: -11.2467 (1 free parameter(s))
Estimate(s): 0.75 

Estimate the parameters of a logistic regression (“logit”)

Now we’ll use maximum likelihood for a more realistic modeling task: predicting a binary outcome.

We’ll use this Love Island dataset again:

love <- read_csv("https://raw.githubusercontent.com/cobriant/320data/master/love.csv") %>% 
  mutate(win = if_else(outcome %in% c("winner", "runner up", "third place"), 1, 0))

The problem with a linear probability model

In classwork 5, we took the Love Island data set and estimated a linear probability model. The linear probability model uses a straight line to predict the probability of \(win = 1\):

\[\hat{p}(x) = \beta_0 + \beta_1 x\]

That can be useful for intuition, but it has a serious problem: it can easily predict probabilities below 0 or above 1, which are not valid.

Here is the linear probability model and the logit side-by-side:

love %>%
  ggplot(aes(x = age, y = win)) +
  geom_jitter(width = .1, height = .1) +
  geom_smooth(method = lm, se = F) +
  labs(title = "Linear Probability Model")

love %>%
  ggplot(aes(x = age, y = win)) +
  geom_jitter(width = .1, height = .1) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = F) +
  labs(title = "Logit")

The logit curve is bounded by 0 and 1: it will never predict probabilities outside that range.

What is a Logit?

A logit model assumes the probability of success follows an S-shaped curve (linear in log odds, not in probability):

\[p(x) = \frac{1}{1 + e^{- (\beta_0 + \beta_1 x)}}\]

This function is called the logistic function. It has two big advantages:

  1. It is bounded by 0 and 1
  2. The effect of x on the probability of success can be large in the middle, and flatten out near probabilities of 0 and 1, which is realistic for many situations.

A rearrangement of the same model in terms of log odds:

\[\log\left(\frac{p(x)}{1 - p(x)}\right) = \beta_0 + \beta_1 x\]

So logistic regression is linear in the log-odds, even though it’s curved in probability.

Question 8

Suppose a contestant has probability \(p = 0.2\) of winning. Compute their odds \(\frac{p}{1-p}\).

Question 9 Using the same \(p = 0.2\), compute the log-odds: \[\log\left(\frac{p}{1-p}\right)\]

Likelihood for logit

For each person \(i\), the model gives a predicted probability:

\[p_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\]

Since \(y_i\) is either 0 or 1 (the person either wins or loses), the likelihood contribtion of observation \(i\) is:

\[p_i^{y_i} (1 - p_i)^{1-y_i}\]

So the full likelihood is just the product across all observations:

\[L(\beta_0, \beta_1) = \prod_i p_i^{y_i} (1 - p_i)^{1-y_i}\]

And the log likelihood is simply the sum:

\[LL(\beta_0, \beta_1) = \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i)\]

That’s the function we’ll maximize.

Estimation using maxLik

Step 1: Write a log-likelihood function

We need to:

  1. Take guesses for \(\beta_0\) and \(\beta_1\)
  2. Use them to compute predicted probabilities \(p_i\)
  3. Plug them into the log-likelihood formula
  4. Return the total log likelihood

Question 10

Fill in the blank so that the function outputs the log likelihood for a guess for \(\beta_0\) and \(\beta_1\).

# LL <- function(beta) {
#   b0 <- beta[1]
#   b1 <- beta[2]
# 
#   x <- love %>% pull(age)
#   y <- love %>% pull(win)
# 
#   p <- 1 / (1 + exp(-(b0 + b1 * x)))
# 
#   sum(___)
# }

Step 2: Maximize the log likelihood

Question 11

Fill in the blank to get maxLik to estimate \(\beta_0\) and \(\beta_1\) for us in the logit.

# maxLik(
#   logLik = __,
#   start = c(0, 0)
# )

Step 3: Interpreting the results

Question 12

What did you estimate \(\beta_0\) to be? What did you estimate \(\beta_1\) to be?

In a logit model, \(\beta_0\) is not the baseline probability and \(\beta_1\) is not the increase in probability when age increases by 1. Instead, they are in terms of log odds. To recover predicted probabilities, plug \(\beta_0\) and \(\beta_1\) back into the logistic function:

\[p_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\]

Question 13

Find the predicted probabilities that someone wins who is 0 years old, 1 year old, 28 years old, 29 years old, 50 years old, and 51 years old.

Download this Assignment

Here’s a link to download this assignment.

Autograder

Here’s the autograder for this assignment.