Dynamic Programming: Value Iteration

In Q-learning, the agent doesn’t know anything about the model when they begin, and they learn from experience.

Dynamic Programming, on the other hand, assumes the model is known: the agent knows the move() function and the payoff function. We will implement a technique called value iteration in this assignment.

Value Iteration:
- Repeatedly applies the Bellman backup equation: \(V_{\text{new}}(s) = \max_a [r(s, a) + 0.9 V_{\text{old}}(s')]\)
- You extract a greedy policy after V converges.

Then we’ll add preference shocks:

At each time t, each cell j has a shock from the standard normal distribution (mean 0, standard deviation 1).
The agent sees all shocks before choosing their next action.
If the agent moves into cell s’, it receives the shock it saw plus the payoff of cell s’.
This makes the best response vary across time even from the same state.

GridWorld

We’ll use the same GridWorld as before:

┌───┬───┬───┐
│ 0 │ 0 │ 0 │
├───┼───┼───┤
│ 1 │ 2 │ 1 │
├───┼───┼───┤
│ 0 │ 0 │ 0 │
└───┴───┴───┘

library(tidyverse)

actions <- c("north", "south", "east", "west", "stay")

move <- function(cell, action) {
  if (action == "stay") {
    return(cell)
  } else if (action == "south") {
    if (cell <= 6) return(cell + 3) 
    else return(cell)
  } else if (action == "north") {
    if (cell >= 4) return(cell - 3) 
    else return(cell)
  } else if (action == "east") {
    if (cell %in% c(1, 2, 4, 5, 7, 8)) 
      return(cell + 1) 
    else return(cell)
  } else if (action == "west") {
    if (cell %in% c(2, 3, 5, 6, 8, 9)) 
      return(cell - 1) 
    else return(cell)
  }
}

payoffs <- c(0, 0, 0, 1, 2, 1, 0, 0, 0)

Question 1:

Write a function init_V that takes no inputs and returns a vector of length 9 filled with zeros.

init_V <- function() ___

library(testthat)
test_that("init_V works", {
  V <- init_V()
  expect_type(V, "double")        # numeric vector
  expect_length(V, 9)             # correct size
  expect_true(all(V == 0))        # all zeros
})

Question 2: We’ll start with doing value iteration on this GridWorld. Build a one-step lookahead function with this signature:

q <- function(V, s, a) { ... }

It should return the value of being in state s and taking action a next, that is: payoffs(s) + 0.9 * V[s'] where s' = move(s, a).

q <- function(V, s, a, payoffs) {
  ___
}

test_that("q works", {
  expect_equal(
    q(V = c(0, 2, rep(0, 7)), s = 1, a = "east", payoffs = c(1, rep(0, 8))),
    2.8)
})

Question 3: Implement Value Iteration

Write a function: value_iteration <- function(payoffs, tol = 1e-6, max_iter = 2000) { ... }

Requirements: - Start: V <- init_V() - Repeat backups: V_new(s) = max_a q(V_old, s, a) - Stop when max(abs(V_new - V_old)) < tol or when iter hits max_iter - Return a list with: V rounded to 3 decimals and iterations

value_iteration <- function(payoffs, tol = 1e-6, max_iter = 2000) {
  V <- init_V()
  V_old <- rep(1, 9)
  i <- 0
  
  while(max(abs(V - V_old)) > tol && i < max_iter) {
    V_old <- V
    
    V <- map_dbl(
      .x = seq_along(payoffs),
      .f = function(s) {
        qs <- map_dbl(actions, function(a) ___)
        max(___)
      }
    )
    i <- i + 1
  }
  list(
    V = round(V, 3),
    iterations = i
  )
}

test_that("value_iteration works", {
  v <- value_iteration(payoffs = c(0, 0, 0, 1, 2, 1, 0, 0, 0))
  expect_equal(v$V, c(17.1, 18, 17.1, 19, 20, 19, 17.1, 18, 17.1))
})

Question 4: Extract the greedy policy from a value function

Write: greedy_policy_from_V <- function(V, payoffs) { ... }

It should return a character vector of length 9 of best responses. Tie-breaking rule: If multiple actions tie for the max, randomly pick one.

greedy_policy_from_V <- function(V, payoffs) {
  map_chr(
    .x = seq_along(payoffs),
    .f = function(s) {
      qs <- map_dbl(.x = actions, .f = function(a) {
          ___ + 0.9 * V[___]
        }
      )
      sample(actions[which(qs == max(qs))], size = 1)
    }
  )
}

test_that("value_iteration works", {
  payoffs <- c(0, 0, 0, 1, 2, 1, 0, 0, 0)
  v <- value_iteration(payoffs)
  pi <- greedy_policy_from_V(v$V, payoffs)
  expect_equal(pi, c("south", "south", "south", 
                     "east", "stay", "west",
                     "north", "north", "north"))
})

At this point, value iteration has solved the deterministic dynamic programming problem (no uncertainty). The greedy policy tells us which action would be optimal if payoffs were fixed.

But real decision-makers often face small idiosyncratic shocks. You usually buy the ham sandwich, but today you buy the grilled cheese: maybe the grilled cheese is on sale, or maybe you were just feeling like something different today.

We’ll now introduce preference shocks and study how the agent’s best responses can change, even when the continuation values are fixed.

Question 5: Add preference shocks

At each time step t, every cell j receives a shock \(\varepsilon_t(j) \sim N(0, 1)\)

The agent observes all shocks before choosing an action.

If the agent moves into cell \(s'\), it receives \(\text{payoff} = \text{payoff}(s') + \varepsilon_t(s')\)

Write two functions:

draw_shocks() {...} that takes no inputs and returns a vector of random normals of length 9.
q_shock(V, s, a, shocks, payoffs) {...} that returns the q value for being in state s, taking action a, given some shocks.

draw_shocks <- function() {
  ___
}

q_shock <- function(V, s, a, shocks, payoffs) {
  s_prime <- ___
  ___ + ___ + 0.9 * ___
}

test_that("q_shock works", {
  set.seed(1)
  shocks <- draw_shocks()
  V <- rep(0, 9)
  expect_equal(
    q_shock(V, s = 5, a = "north", shocks, payoffs = payoffs),
    payoffs[2] + shocks[2]
  )
})

Question 6: Best response given shocks

Write a function greedy_action_given_shocks <- function(V, s, shocks, payoffs) {...}.

This function should compute the agent’s best response at a single time step.

greedy_action_given_shocks <- function(V, s, shocks, payoffs) {
  qs <- map_dbl(actions, function(a) {
    ___
  })
  sample(___, size = 1)
}

Question 7: Run an experiment

Create Vstar, the value function for the GridWorld we’ve been using.
Let the agent start in the middle, set to collect the payoff of 2 next period if they stay.
500 times: draw a shock and let the agent choose which action to take.
How many times did the agent stay versus move north, south, east, and west?
Is the agent more likely to move east and west rather than north or south? Think about why that might be.

Vstar <- value_iteration(payoffs)$V

choices <- replicate(500, {
  shocks <- draw_shocks()
  greedy_action_given_shocks(Vstar, s = 5, shocks, payoffs)
})

tibble(action = choices) %>%
  count(action)

Question 8: Generate trajectories under shocks

Now we’ll move from a “one-step choice” with a single shock draw to dynamic behavior, drawing new shocks at each time step.

Write a function: simulate_episode <- function(V, payoffs) {...}

This function should simulate an episode where:

the agent starts in a random state in 1:9
at each time step, the agent draws a fresh shock vector and chooses a best-response action
the agent transitions to the next state via move()
the realized payoff is the payoff of the destination cell plus the shock of the destination cell
the episode ends randomly after each step with probability 0.1

simulate_episode <- function(V, payoffs) {
  state <- ___
  t <- 1
  out <- list()

  repeat {
    shocks <- ___
    action <- ___
    next_state <- ___
    payoff <- ___ + ___

    out[[t]] <- tibble(
      t = t,
      state = state,
      action = action,
      next_state = next_state,
      payoff = payoff
    )

    if (runif(1) > .9) break

    state <- next_state
    t <- t + 1
  }

  bind_rows(out)
}

simulate_episode(Vstar, payoffs)

Question 9: Animate a trajectory

Reuse the function animate_episode() from last classwork to run animate_episode(simulate_episode(Vstar, payoffs)).

anim <- animate_episode(simulate_episode(Vstar, payoffs))

anim_save("chapters/animations/anim4.gif", anim)

Download this assignment

Here’s a link to download this assignment.