Rust (1987) Part 1: Estimation

Everything we’ve done in this class so far has been building toward this moment! Today, we’ll estimate a dynamic discrete choice model (AKA inverse reinforcement learning model) where an agent makes repeated decisions over time, under uncertainty. We’ll estimate the payoff function they were most likely maximizing over.

We’ll replicate the core estimation idea in John Rust (1987), Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher. By the end, you will be able to explain why Rust’s estimation routine has:

an inner loop that solves the dynamic program (value iteration)
an outer loop that estimates parameters by maximum likelihood (a logit model)

The Story

Harold Zurcher managed a fleet of buses in Madison, Wisconsin. Each month, for each bus, he chose one of two actions:

Keep the current engine (do routine maintenance)
Replace the engine (pay a large one-time cost and reset the engine to new)

Maintenance costs tend to rise with accumulated mileage. Replacement is costly but resets the system.

We will model this with a simple state variable: the engine’s mileage at the start of the month.

The Model

Think of the state space as a 2x90 GridWorld: a staircase plus a ledge, where higher elevation means higher costs:

A new engine starts at the bottom of the staircase, with low maintenance costs. As time goes by, the engine accumulates more and more mileage, climbing up the staircase and requiring more costly maintenance. At some point, Zurcher decides it’s worthwhile to replace the bus engine, which is represented by a step to the left onto the ledge, paying the replacement cost (the height of the ledge). After replacing the engine and by moving onto the ledge, the engine starts over at the bottom of the staircase again.

Parameters we’ll estimate

Rust estimates two key parameters:

\(\theta_1\): how quickly maintenance costs rise as mileage increases
\(\theta_2\): the replacement cost

We will treat these as unknown and estimate them from observed behavior.

Question 1: Interpretation What does a larger \(\theta_2\) imply about replacement behavior? What about a larger \(\theta_1\)? Explain.

The Data

Zurcher recorded each bus’s mileage bucket at the start of each month and whether an engine replacement occurred that month.

There are 104 buses in the fleet
Observations are monthly
Mileage is discretized into buckets of 5,000 miles
If mileage == 1, the engine has between 0 and 4,999 miles at the start of that month.

library(tidyverse)

bus <- read_csv("https://raw.githubusercontent.com/cobriant/dplyrmurdermystery/master/bus.csv") %>%
  group_by(bus) %>%
  mutate(month = row_number()) %>%
  ungroup()

# Question 2: One bus over time
# Take bus "GMC_A5308_75_5302". Visualize a time
# series of the mileage. What happened at month 30?

bus %>% 
  filter(___) %>%
  ggplot(aes(x = ___, y = ___)) +
  geom_line()

# Question 3: Visualize a time series of engine mileages
# for all buses, with color representing bus. You should
# see that Zurcher replaces bus engines in a wide range
# of accumulated mileages: there are plenty of 
# unobserved variables here going in to Zurcher's decision.

bus %>% 
  ggplot(aes(x = ___, y = ___, color = ___)) +
  geom_line() +
  theme(legend.position = "none")

Markov Transition Probabilities

The mileage state evolves stochastically. Rust assumes that, if Zurcher keeps the engine, then next month’s mileage bucket changes by 0, 1, or 2 buckets:

\(p_0\): probability of moving up 0 buckets
\(p_1\): probability of moving up 1 bucket
\(p_2\): probability of moving up 2 buckets

Rust estimates: \[(p_0, p_1, p_2) = (.3489, .6394, .0117)\]

We’ll build two 90x90 transition matrices:

\(T_{keep}\): transition matrix if action is keep
\(T_{replace}\): transition matrix if action is replace (reset engine)

Under keep, your next state depends on your current state (banded matrix).

Under replace, the engine resets, so the next state distribution is the same no matter where you started (all rows identical).

\[ T_{\text{keep}} = \begin{bmatrix} p_0 & p_1 & p_2 & 0 & 0 & 0 & \cdots \\ 0 & p_0 & p_1 & p_2 & 0 & 0 & \cdots \\ 0 & 0 & p_0 & p_1 & p_2 & 0 & \cdots \\ \vdots & & & \ddots & & & \ddots \end{bmatrix} \]

\[ T_{\text{replace}} = \begin{bmatrix} p_0 & p_1 & p_2 & 0 & 0 & 0 & \cdots \\ p_0 & p_1 & p_2 & 0 & 0 & 0 & \cdots \\ p_0 & p_1 & p_2 & 0 & 0 & 0 & \cdots \\ \vdots & & & \ddots & & & \ddots \end{bmatrix} \]

Question 4: Interpret the transition matrices Before coding: why is \(T_{keep}\) mostly zeros except near the diagonal? Why does \(T_{replace}\) have identical rows?

p0 <- .3489
p1 <- .6394
p2 <- .0117

# Question 5: Fill in the blanks to create t_replace:
t_replace <- matrix(rep(0, 90*90), nrow = 90)

for(i in 1:90) {
  t_replace[i, 1] <- ___
  t_replace[i, 2] <- ___
  t_replace[i, 3] <- ___
}

# Check that all rows sum to probability of 1:
all(rowSums(t_replace) == 1)

# Question 6: Fill in the blanks to create t_keep:
t_keep <- matrix(rep(0, 90*90), nrow = 90)

for(i in 1:88) {
  t_keep[i, i] <- ___
  t_keep[i, i + 1] <- ___
  t_keep[i, i + 2] <- ___
}

# Mileage bucket 90 must be absorbing:
t_keep[89, 89] <- p0
t_keep[89, 90] <- 1 - p0
t_keep[90, 90] <- 1

# Check that all rows sum to probability of 1:
all(rowSums(t_keep) == 1)

Inner Loop: Value Iteration

Given a candidate \(\theta = (\theta_1, \theta_2)\), we’ll solve for the value function using value iteration. We’ll use discount factor \(\beta = 0.9999\).

The issue: in value iteration, we’d usually use something like max(value_keep, value_replace) to find the action with the highest value in each state. But the outer loop is all about numerical optimization, and max() is not differentiable (too pointy). So we’ll use a smooth approximation:

\[\max(x, y) \approx \log(\exp(x) + \exp(y))\]

Logsumexp with the stability trick

logsumexp <- function(x, y) {
    m <- pmax(x, y)
    m + log(exp(x - m) + exp(y - m))
}

Question 7: why do we need this? Explain why log(exp(x) + exp(y)) can be numerically unsafe. Hint: think about what happens if x is very large. The problem isn’t the logarithm, it’s what happens before the log is taken.

Question 8: write the value iteration function

Your function should:

take theta (vector of length 2)
iterate until convergence
return v_keep and v_replace

S <- 90
beta <- .9999

value_iteration <- function(theta, tol = 0.01) {
  V <- rep(0, S)
  V_old <- rep(1, S)

  while (max(abs(V - V_old)) > tol) {
    V_old <- V

    # EV: continuation values
    EV_keep <- as.vector(___ %*% V_old)
    EV_replace <- as.vector(___ %*% V_old)

    # v_keep = value of keeping. Payoff today:
    #          (-theta[1] * 0.001 * (1:S)) +
    #          beta * continuation value.
    #          Scaling theta[1] by .001 makes theta[1]
    #          and theta[2] the same scale, again
    #          for better numerical optimization.
    # v_replace = value of replacing. Pay replacement cost;
    #          next period, get value of starting over.
    v_keep <- -theta[1] * 0.001 * (1:S) + beta * ___
    v_replace <- -theta[2] + beta * ___

    # logsumexp instead of pmax() for smooth differentiation
    V <- logsumexp(v_keep, v_replace)
  }

  list(v_keep = v_keep, v_replace = v_replace, V = V)
}

# Check: you should get values like -848 and -849.
value_iteration(c(5, 5))$v_keep[1:10]

Outer Loop

Here are the steps to building the outer loop

choice_prob(theta) takes a candidate theta, does the inner loop to find the value function, and computes then returns the logit choice probabilities.
log_lik takes choice probabilities and returns the log likelihood of seeing Rust’s data given those choice probabilities/
Finally, we’ll estimate the dynamic discrete choice model with a numeric optimizer, finding the candidate theta with the highest log likelihood.

Step 1: From the value function to logit choice probabilities

With extreme value shocks, we get logit:

\[P(keep) = \frac{\exp(v_{keep}(s))}{\exp(v_{keep}(s)) + \exp(v_{replace}(s))}\]

We’ll compute it stably:

choice_prob <- function(theta) {
  v <- value_iteration(theta)
  m <- pmax(v$v_keep, v$v_replace)
  exp_vk <- exp(v$v_keep - m)
  exp_vr <- exp(v$v_replace - m)
  exp_vk / (exp_vk + exp_vr)
}

Question 9: comparative statics If \(\theta_1\) (maintenance costs) rises, what should happen to \(P(keep)\) at high \(s\)? If \(\theta_2\) (replacement costs) rises, what should happen to P(keep)?

Step 2: from choice probabilities to the likelihood

Here’s a summary of the Rust data set with the number of times a bus was and was not replaced in each mileage bucket:

rust <- read_csv("https://raw.githubusercontent.com/cobriant/dplyrmurdermystery/master/rust.csv")

n_keep <- rust %>% filter(replace == F) %>% pull(n)
n_replace <- rust %>% filter(replace == T) %>% pull(n)

For each state \(s\), we observe \(n_{s, keep}\) and \(n_{s, replace}\). So the log likelihood is:

\[LL(\theta) = \sum_{s = 1}^{90} (n_{s, keep} \log p_s(\theta) + n_{s, replace} \log (1 - p_s(\theta))\]

Question 10: write log_lik(p_keep)

Multiply counts n_keep and n_replace by log probabilities (p_keep will be a vector of length 90), and make sure your function returns a single number: the log likelihood.

log_lik <- function(p_keep) {
  sum(___ * log(___) + ___ * log(1 - ___))
}

Step 3: Optimization with an optimizer

Now we combine everything:

likelihood <- function(theta) {
  p_keep <- choice_prob(theta)
  
  # Numerical safety adjustment: clip probabilities into the
  # safe interval [eps, 1-eps], preventing log(0).
  eps <- 1e-12
  p_keep <- pmin(pmax(p_keep, eps), 1 - eps)
  log_lik(p_keep)
}

# Maximize the log likelihood by
# minimizing the negative:
fit <- optim(
  par = c(1, 9),
  fn = function(theta) -likelihood(theta),
  method = "BFGS",
  control = list(maxit = 500, reltol = 1e-8)
)

# Check: you should get maintenance costs = 2.7;
# replacement cost = 9.8.
fit$par

Download this assignment

Here’s a link to download this assignment.