2 Maximum Likelihood

2.1 Introduction

If you try to fit a model of discrete choice using the method of least squares, you’ll end up with a linear probability model. But the linear probability model has two big drawbacks I talked about in the previous chapter (predicted probabilities aren’t guaranteed to be between 0 and 1; estimated marginal effects are constant).

To fit a logit instead (which solves those two issues), you need a different method of estimation. That method is called Maximum Likelihood Estimation (MLE) and it is the subject of this chapter.

In this chapter I’ll introduce MLE with an intuitive example, then I’ll show how MLE is used to estimate the parameters of a logit model.

2.1.1 Notation

I’ll include two more greek symbols in this chapter for convenience:

symbol	pronounciation
\(\mu\)	mu
\(\sigma\)	sigma

2.2 MLE Example

2.2.1 PDF’s

Recall from your probability class that for a continuous random variable X, the probability that X is any single number (like 0) is zero. But the probability that X is between any two numbers (like between 0 and 0.5) may be nonzero and it can be calculated using the area under the probability density function (PDF). Here’s the PDF for the normal distribution. The area in red is the probability a variable distributed N(0, 1) falls between 0 and 0.5:

library(tidyverse)

ggplot() +
  stat_function(fun = dnorm, geom = "line") +
  stat_function(fun = dnorm, geom = "area",
                fill = "red", xlim = c(0, .5)) +
  xlim(-5, 5)

The total area under the curve for a PDF is equal to 1.

Exercise 1: Let X be a random variable distributed N(0, 1). What is the PDF of X evaluated at 0? Use two methods to find the answer/check your work: one method is to use dnorm() in R and the other is to use the pdf of the normal distribution: \(f(X = x) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp\left ( \frac{(x - \mu)^2}{-2 \sigma^2}\right)\) where the mean \(\mu = 0\) and the variance \(\sigma^2 = 1\).

Exercise 2: Show that .0965 is the joint probability density for two independent random variables \(X_1\) and \(X_2\), both distributed N(0, 1), where \(X_1 = 0\) and \(X_2 = 1\). No need to use two methods like in the previous problem: you can just use dnorm().

2.2.2 Maximum Likelihood

The way maximum likelihood estimation (MLE) works is very intuitive, and hopefully I will demonstrate that in this example.

Now suppose you have a random variable \(X\) that is distributed normally with an unknown mean \(\mu\) and a variance of 1: \(X\) ~ \(N(\mu, 1)\).

You have a sample of 2 observations of X. They are that X took on 4 and, on a “re-roll”, 6. What is the most likely value for \(\mu\)? If you guessed 5, you’d be correct. The most likely expectation of X is the mean of the sample. Here’s how to rigorously prove that \(\mu = 5\) is the value that maximizes the likelihood of seeing the data \((4, 6)\):

The density function for the normal distribution with mean \(\mu\) and variance \(\sigma^2\) is this:

\[f(X) = \frac{1}{\sqrt{2 \pi \sigma^2}} \ \exp \left (\frac{(X - \mu)^2}{-2 \sigma^2} \right )\]

And given \(\sigma^2 = 1\) for the problem, I’ll simplify:

\[f(X) = \frac{1}{\sqrt{2 \pi}} \ \exp \left (\frac{(X - \mu)^2}{-2} \right )\]

The density of the normal distribution evaluated at 4 is this:

\[f(X = 4) = \frac{1}{\sqrt{2 \pi}} \ \exp \left (\frac{(4 - \mu)^2}{-2} \right )\]

But not only do we see X taking on 4, but we also see X taking on 6. So the likelihood function \(L\) we’ll need is the joint probability that these independent draws of X would yield 4 and 6:

\[\begin{align} L(4, 6) &= f(X = 4) * f(X = 6) \\ &= \frac{1}{\sqrt{2 \pi}} \ \exp \left (\frac{(4 - \mu)^2}{-2} \right ) * \frac{1}{\sqrt{2 \pi}} \ \exp \left (\frac{(6 - \mu)^2}{-2} \right ) \\ &= \frac{1}{2 \pi} \exp \left ( \frac{16 - 8 \mu + \mu^2 + 36 - 12 \mu + \mu^2}{-2} \right) \\ &= \frac{1}{2 \pi} \exp \left ( \frac{52 - 20 \mu + 2 \mu^2}{-2} \right ) \\ &= \frac{1}{2 \pi} \exp \left ( - 26 + 10 \mu - \mu^2 \right ) \end{align}\]

Maximum likelihood estimation requires choosing parameters \(\mu\) that maximize the likelihood function. But I’m stuck here because taking first order conditions would not get rid of the exp(). Instead I’ll do a trick: because log() is a monotonically increasing function, the \(\mu\) that maximizes the log of the likelihood function will always be the same as the \(\mu\) that maximizes the likelihood function. So I can instead focus on the log likelihood function \(LL\):

\[\begin{align} LL &= \log(\frac{1}{2 \pi} \exp \left ( - 26 + 10 \mu - \mu^2 \right )) \\ &= \log\left(\frac{1}{2 \pi}\right) + \log(exp(-26 + 10 \mu - \mu^2)) \\ &= \log\left(\frac{1}{2 \pi}\right) + -26 + 10 \mu - \mu^2 \\ \end{align}\]

The log likelihood has a much simpler form than the likelihood function.

To complete the maximum likelihood estimation, recall we need to choose \(\mu\) to maximize the log likelihood function:

\[\max_\mu \{ \log\left(\frac{1}{2 \pi}\right) + -26 + 10 \mu - \mu^2 \}\]

Taking first order conditions: the derivative of the log likelihood function with respect to \(\mu\) equals 0 when the likelihood function is maximized:

\[10 - 2 \mu = 0\]

\[\mu = 5\]

Which was our intuitive answer at the beginning of this section.

Exercise 3: Again let \(X\) ~ \(N(\mu, 1)\). Suppose X was observed to take on the values 0 and 2. Show, using the method of maximum likelihood, that the most likely value for \(\mu\) is \(1\). Make sure to write down the likelihood function, the log likelihood function, and the first order conditions.

2.3 Logit Estimation

In this section, I’ll discuss how functions like glm() use the method of maximum likelihood to estimate logits.

Suppose for the sake of simplicity we have just 6 data points and we want to fit a logit. The first column is the discount a person was offered to buy a Big Mac, and the second column is their (binary) choice to buy it or not. A 0 indicates they did not buy it and a 1 indicates they did. So the first person, who was not offered a discount (0% off), chose not to buy the big mac. The second person, who was offered 20% off, also chose not to buy the big mac. But the third person, offered 40% off, bought it:

Discount	Big Mac Purchased
0% off	0
20% off	0
40% off	1
60% off	1
80% off	0
100% off	1

Here’s a visualization of the data points and the logit we’ll estimate in this section using MLE. It seems like the higher the discount, the more likely it is that the person buys a big mac.

library(tidyverse)

tibble(
    discount = seq(0, 100, by = 20),
    purchase = c(0, 0, 1, 1, 0, 1)
) %>%
    ggplot(aes(x = discount, y = purchase)) +
    geom_point(size = 3) +
    stat_smooth(method = "glm", se = FALSE, method.args = list(family = binomial))

`geom_smooth()` using formula 'y ~ x'

The likelihood function L is the joint probability of observing these 6 data points:

\[\begin{align} L = &f(y = 0 | x = 0) * f(y = 0 | x = 20) * f(y = 1 | x = 40) * f(y = 1 | x = 60) \ * \\ &f(y = 0 | x = 80) * f(y = 1 | x = 100) \end{align}\]

Use the log likelihood trick again:

\[\begin{align} LL = \ &log(f(y = 0 | x = 0)) + log(f(y = 0 | x = 20)) + log(f(y = 1 | x = 40)) + \\ &log(f(y = 1 | x = 60)) + log(f(y = 0 | x = 80)) + log(f(y = 1 | x = 100)) \end{align}\]

Because we’re using a logit, we assume from chapter 1 equation 1.1, the probability of buying a big mac given a certain discount is: \(f(y = 1 | x) = \frac{exp(\beta_0 + \beta_1 x)}{1 + exp(\beta_0 + \beta_1 x)}\) and the probability of not buying one is: \(f(y = 0 | x) = 1 - f(y = 1 | x) = \frac{1}{1 + exp(\beta_0 + \beta_1 x)}\). So making that substitution gives us:

\[\begin{align} LL = \ &log \left (\frac{1}{1 + exp(\beta_0)}\right) + log\left(\frac{1}{1 + exp(\beta_0 + \beta_1 20)}\right) + log\left(\frac{exp(\beta_0 + \beta_1 40)}{1 + exp(\beta_0 + \beta_1 40)}\right) + \\ &log\left(\frac{exp(\beta_0 + \beta_1 60)}{1 + exp(\beta_0 + \beta_1 60)}\right) + log\left(\frac{1}{1 + exp(\beta_0 + \beta_1 80)}\right) + log\left(\frac{exp(\beta_0 + \beta_1 100)}{1 + exp(\beta_0 + \beta_1 100)}\right) \end{align}\]

This is as far as we’ll go by hand. Let’s go to R and write a function LL that takes a vector of length 2 that represents the logit parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\) and returns the log likelihood of seeing these 6 data points we observed given those values for parameters.

LL <- function(params) {
  
  b0 <- params[1]
  b1 <- params[2]
  
  log(1 / (1 + exp(b0))) +
    log(1 / (1 + exp(b0 + b1 * 20))) +
    log(exp(b0 + b1 * 40) / (1 + exp(b0 + b1 * 40))) +
    log(exp(b0 + b1 * 60) / (1 + exp(b0 + b1 * 60))) +
    log(1 / (1 + exp(b0 + b1 * 80))) +
    log(exp(b0 + b1*100) / (1 + exp(b0 + b1 * 100)))
}

# Testing our function works:

LL(c(1, 2))

[1] -203.3133

Don’t be surprised that the log likelihood is negative: recall that the log of “even odds” of 50/50 is log(1) = 0, and if odds are less than 50/50, the log odds are negative.

Exercise 4: What is the log likelihood of seeing these 6 data points given \(\beta_0 = -1\) and \(\beta_1 = .05\)? Are these values for the parameters more likely or less likely than \(\beta_0 = 1\) and \(\beta_1 = 2\)? You can use my function LL to solve this problem.

So we’ve defined a function LL that calculates the log likelihood of seeing the data given any values for the logit parameters. The next step is to choose values for those parameters that maximize this log likelihood function. We’re not going to do this by hand; I want to show you a function in R called maxLik that handles this for us. It’s from the package of the same name, and it takes 2 things as arguments: it takes the log likelihood function you want to maximize, and it takes a starting point. It searches out from that starting point and gives you the first local maximum it finds.

Here I use maxLik() to finish the logit estimation:

library(maxLik)

maxLik(LL, start = c(0, 0)) %>%
    broom::tidy()

# A tibble: 2 × 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 1      -1.68      1.80      -0.934   0.350
2 2       0.0337    0.0308     1.10    0.273

So maxLik finds that LL is maximized when \(\beta_0 = -1.68\) and \(\beta_1 = .0337\). Compare this to the glm() estimates below. glm() works a lot like lm(): you pipe your data in, provide a formula y ~ x, and use family = "binomial" to estimate a simple logit.

tibble(
    discount = seq(0, 100, by = 20),
    purchase = c(0, 0, 1, 1, 0, 1)
) %>%
    glm(purchase ~ discount, data = ., family = "binomial") %>%
  broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)  -1.68      1.80      -0.937   0.349  -6.57       1.39 
2 discount      0.0337    0.0307     1.10    0.273  -0.0177     0.118

2.4 Programming Exercises

Exercise 5: Complete project Euler problem 2: Even Fibonacci numbers.

2.5 References

Dougherty (2016) Chapter 10: Binary Choice and Limited Dependent Variable Models, and Maximum Likelihood Estimation

Henningsen and Toomet (2011) R package maxLik