RL vs IRL; McFadden’s Binary Choice

Part 1: RL vs IRL

Read the introduction of Ng and Russell (2000). Then answer these questions in your own words:

What is the reinforcement learning problem?
What is the inverse reinforcement learning problem?
Where might inverse reinforcement learning be especially useful?

This Ng and Russell paper sparked a large literature in machine learning and AI focused on solving the inverse reinforcement learning problem. One approach became especially influential: maximum entropy IRL (2008). In 2021, a paper from Roboticists at Carnegie Mellon demonstrated that maximum entropy IRL is a rediscovery of John Rust’s 1987 Nested Fixed Point algorithm. Rust’s paper will be the central focus of our Unit 3.

Part 2: McFadden’s Binary Choice

Rust’s story begins with his Doctoral adviser, Daniel McFadden. McFadden won the Nobel Prize in Economics in 2000 for his development of the theory and methods for analyzing discrete choice. In this section, we’ll talk about a small piece of this work: bridging the gap between microeconomic binary choice and the logistic function.

A discrete choice is a choice among distinct alternatives. The choice could be binary (“buy it” or “don’t”), or there could be many alternatives (“buy nothing”, “buy a hamburger”, “buy 2 hamburgers”“,”buy spaghetti”, etc). In this section, I’ll make our lives simpler by focusing only on binary choices: the agent can choose option A or option B, they can’t choose both and they can’t choose neither.

Discrete choices are in constrast to continuous choices, measured on a continuous scale (how many ounces of spaghetti would you like?). A variable that’s discrete is counted while a variable that’s continuous is measured.

Return of the Logit

As economists, we usually think the decision maker selects the alternative that provides the most utility. So if option A is a big mac and option B is a chicken sandwich, person \(i\) will choose a big mac if:

\(\text{Utility}_i^{\text{big mac}} \geq \text{Utility}_i^{\text{chicken sandwich}}\)

As long as utility functions are additively separable, we can express someone’s utility as a predictable, observable part \(V\) (perhaps having to do with their age and gender) added to an unobservable part \(\varepsilon\) (idiosyncratic differences having to do with that specific person at that specific time, unrelated to any variable that is observed by us as the scientists). So the person will choose a big mac if:

\(V_i^{\text{big mac}} + \varepsilon_i^{\text{big mac}} \geq V_i^{\text{chicken}} + \varepsilon_i^{\text{chicken}}\)

If the observable part of utility \(V\) is a linear function of observables like age and gender, we have:

\(V_i^{\text{big mac}} = \alpha_0 + \alpha_1 \text{age}_i + \alpha_2 \text{male}_i\)

\(V_i^{\text{chicken}} = \gamma_0 + \gamma_1 \text{age}_i + \gamma_2 \text{male}_i\)

Suppose in the equations above, you have that \(\alpha_0 = 1\), \(\alpha_1 = -0.1\), \(\alpha_2 = 0.5\), \(\gamma_0 = -1\), \(\gamma_1 = 0.2\), \(\gamma_2 = -0.3\). Find \(V_i^{\text{big mac}}\) and \(V_i^{\text{chicken}}\) for the average 20 year old male and the average 50 year old female. If \(\varepsilon_i^{\text{big mac}} = 0\) and \(\varepsilon_i^{\text{chicken}} = 0\) for a 20 year old male, should you expect that he will buy a big mac or a chicken sandwich?

A logit doesn’t help us to estimate people’s \(V_i^{\text{big mac}}\) or \(V_i^{\text{chicken}}\), but it would help us estimate people’s \(V_i^{\text{diff}} = V_i^{\text{big mac}} - V_i^{\text{chicken}}\), and then it helps us estimate the probability that someone with certain values for the explanatory variables will make a certain decision. So we don’t know how much utility someone gets from an option, but that number doesn’t matter. It’s the relative utility they get from different options that impacts their decisions.

An agent will buy a big mac if:

\(V_i^{\text{big mac}} + \varepsilon_i^{\text{big mac}} \geq V_i^{\text{chicken}} + \varepsilon_i^{\text{chicken}}\)

Rearranging, notice that only differences matter. A person makes a certain choice only if the difference to their preference shocks exceed their \(V^{\text{diff}}\):

\(\varepsilon_i^{\text{chicken}} - \varepsilon_i^{\text{big mac}} \leq V_i^{\text{big mac}} - V_i^{\text{chicken}}\)

And if we think of the epsilons as random variables, the probability that someone buys a big mac over a chicken sandwich is:

\(Prob(\text{big mac}_i) = Prob(\varepsilon_i^{\text{chicken}} - \varepsilon_i^{\text{big mac}} \leq V_i^{\text{big mac}} - V_i^{\text{chicken}})\)

And if we make assumptions on the distribution of \(\varepsilon_i^{\text{diff}} = \varepsilon_i^{\text{chicken}} - \varepsilon_i^{\text{big mac}}\), and if we estimate \(V_i^{\text{diff}} = V_i^{\text{chicken}} - V_i^{\text{big mac}}\), then we can calculate this choice probability. If your epsilons depend on your mood, hunger level, what you’ve already eaten that day, and how the restaurant smells, is it the sum of all those things, or is it the maximum? Arguments can be made for both.

If we think \(\varepsilon_i\) is the sum, you’d assume \(\varepsilon_i\) is distributed normally (the sum of normals is normal; the difference between normals is normal). In that case, this defines a probit model.

If you think \(\varepsilon_i\) is the maximum (or minimum), you’d assume \(\varepsilon\) has the extreme value (EV) distribution. The difference between two variables that are extreme value (\(\varepsilon^{\text{diff}}\)) follows a logistic distribution. The maximum (or minimum) of a bunch of normally distributed random variables is distributed EV; the maximum (or minimum) of a bunch of EVs is also EV. This defines the logit model. We’ll focus on the logit model here. In practice, the difference between estimating a logit versus a probit is usually very small.

What is the difference between a logit and a probit?

CDFs: Cumulative Distribution Function

Recall, we’ve used rnorm before to generate random numbers from the normal distribution (a bell curve centered at the mean, with a spread defined by the standard deviation sd):

rnorm(n = 10, mean = 0, sd = 1)

 [1] -0.5984640 -0.8122762  0.1052818 -1.7538084  0.4755833  0.3849749
 [7]  0.9106370 -0.7642865 -1.2417900 -0.4119666

The CDF for the normal distribution is given by pnorm. The CDF for the normal distribution tells you the probability a (normally distributed) random variable is less than or equal to some value.

For example, if X is distributed N(0, 1), the probability that X is less than -1.96 is 2.5%:

pnorm(-1.96)

[1] 0.0249979

And the probability that X is less than 0 is 50%:

pnorm(0)

[1] 0.5

If X is distributed N(0, 1), what is the probability X is less than 1?

Likewise, the CDF for the difference between two extreme value variables (logistic distribution) is given by the function plogis.

If X comes from the logistic distribution, what is the probability X is less than 1?

The CDF of the logistic distribution is given by this formula:

\[F(\varepsilon^{\text{diff}}) = \frac{\exp(\varepsilon^{\text{diff}})}{1 + \exp(\varepsilon^{\text{diff}})}\] Where \(\exp(x)\) refers to Euler’s number (2.71828…) raised to the power of x.

If you estimate a person has a \(V^{\text{bm - ch}} = 0.5\) in favor of big macs, then they would buy a big mac as long as \(\varepsilon^{\text{ch - bm}} \leq\) what value? Use plogis to find the probability of getting such an epsilon.

The logit model traces out the choice probability for all people on the \(V^{\text{diff}}\) horizontal line:

library(tidyverse)

ggplot() +
  stat_function(fun = plogis) +
  xlim(-5, 5)

And the logit choice probability is just the logistic CDF evaluated at \(V^{\text{diff}}\):

\[\begin{align} \text{Prob}(big \ mac) &= F(V^{diff}) \\ &= F(\alpha_0 - \gamma_0 + (\alpha_1 - \gamma_1) age + (\alpha_2 - \gamma_2) male) \\ &= \frac{exp(\alpha_0 - \gamma_0 + (\alpha_1 - \gamma_1) age + (\alpha_2 - \gamma_2) male)}{1 + exp(\alpha_0 - \gamma_0 + (\alpha_1 - \gamma_1) age + (\alpha_2 - \gamma_2) male)} \end{align}\]

Letting \(\alpha_0 - \gamma_0 = \beta_0\), \(\alpha_1 - \gamma_1 = \beta_1\), and \(\alpha_2 - \gamma_2 = \beta_2\):

\[\text{Prob}(big \ mac) = \frac{exp(\beta_0 + \beta_1 age + \beta_2 male)}{1 + exp(\beta_0 + \beta_1 age + \beta_2 male)} \tag{19.1}\]

This is the logit. Notice: it is not linear in parameters, so you can’t use lm() to estimate it.

Instead, a logit is linear in log odds:

\[\log_e \left (\frac{Prob(big \ mac)_i}{Prob(chicken)_i} \right ) = \beta_0 + \beta_1 age_i + \beta_2 male_i\]

And because you either buy a big mac or you buy a chicken sandwich, \(Prob(chicken) = 1 - Prob(big \ mac)\) so the logit is also:

\[\log_e \left (\frac{Prob(big \ mac)_i}{1 - Prob(big \ mac)_i} \right ) = \beta_0 + \beta_1 age_i + \beta_2 male_i\]

Show that if you take the log-odds model above and solve for \(Prob(big mac)\), you’ll get that \[Prob(big \ mac)_i = \frac{exp(\beta_0 + \beta_1 age_i + \beta_2 male_i)}{1 + exp(\beta_0 + \beta_1 age_i + \beta_2 male_i)}\] Which is exactly the way I defined the logit in Equation 19.1.
A logit will never predict a probability outside of the \((0, 1)\) range. Show that this is true using the formula from Equation 19.1. In particular, consider what \(Prob(big \ mac)_i\) will be if \(\beta_0 + \beta_1 age_i + \beta_2 male_i\) is very large, like 1,000,000, or very negative, like -1,000,000.

Another bonus for logits over lm() is that logits predict varying marginal effects in the choice probability. For example, if \(\beta_0 = -2\) and \(\beta_1 = 0.05\):

ggplot() +
  stat_function(fun = function(x) exp(-2 + .05 * x) / (1 + exp(-2 + .05 * x))) +
  xlim(0, 100)

The slope of the line starts small, increases, then decreases: going from x = 15 to 25 creates a small increase you’ll take option A over option B, but not as much as the behavioral change going from x = 40 to 50. And by the time x = 90 or 100, the marginal impact of x on the probability you’ll take option A is small again (perhaps because everyone possibly interested in taking option A is already taking it).

Logit Coefficient Interpretation

To interpret the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\) for a logit, consider the model again:

\[log_e \left (\frac{p(y_i = 1)}{1 - p(y_i = 1)} \right ) = \beta_0 + \beta_1 x_i\]

\(\hat{\beta_0}\) is the estimated log odds of “a success” given x = 0. \(\hat{\beta_1}\) is the estimated increase in the log odds of a success when x increases by 1 unit.

I’ll use glm() to fit the logit for the big mac example using artificial data.

set.seed(1234)

# Here I generate some artificial logit data:
data <- tibble(
  # Let ages be between 16 and 65
  age = sample(16:65, replace = T, size = 1000),                          
  male = sample(0:1, replace = T, size = 1000),
  # The (observable) value someone gets from a big mac depends on their age and gender:
  v = 2 - .1 * age + 2.8 * male,
  # The probability they buy a big mac depends on v according to the logit:
  prob_big_mac = exp(v) / (1 + exp(v)),                                                         
  # They buy a big mac with the probability prob_big_mac.
  big_mac = rbinom(n = 1000, size = 1, prob = prob_big_mac)
) %>%
  mutate(male = as.factor(male))

glm(big_mac ~ age + male, data = data, family = "binomial") %>%
  broom::tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   2.10     0.265        7.95 1.91e-15
2 age          -0.0981   0.00737    -13.3  2.03e-40
3 male1         2.56     0.190       13.5  2.62e-41

Interpret these estimates by answering: what is my estimate for the probability that a 20 year old male will get a big mac? And what about a 50 year old female?

Finally, I’ll visualize the logit I just estimated:

data %>%
    ggplot(aes(x = age, y = big_mac, color = male)) +
    geom_jitter(height = .15) +
    stat_smooth(method = "glm", se = FALSE, method.args = list(family = binomial))

`geom_smooth()` using formula = 'y ~ x'

Download this assignment

Here’s a link to download this assignment.