Replace Rules with Trial and Error Learning

Instead of hard coding a policy with a rule function, the agent will learn what is good by running lots of episodes, collecting outcomes, and updating beliefs from experience.

Key ideas:

An episode is a starting state, a sequence of states and rewards, followed by termination.
Trial and error learning is about policy evaluation from simulation: “If I follow this policy, what payoff do I get on average?”
Action values \(Q(s, a)\) is the expected total payoff if the agent takes action \(a\) in state \(s\) and then follows current behavior
Learning is updating a table with running averages.

Part 1: Running averages

Understanding how to compute running averages will be very helpful for this assignment.

Suppose the average of 4 data points is the value 3: sum(x)/4 = 3. Then you get a fifth data point that is a 1. What is the updated average?
Use your logic from part a) to write a function update_running_avg(avg, n, x_new). Test that it works with the values from part a).

update_running_avg <- function(avg, n, x_new) {
  ___
}

library(tidyverse)
library(testthat)

test_that("updating a running average of 3 with 4 data points with a new value of 1 works", {
  expect_equal(update_running_avg(3, 4, 1), 2.6)
})

GridWorld

For this classwork, use the same GridWorld you worked on for last classwork (no shocks yet):

0	0	0
1	2	1
0	0	0

We’ll use the same move() and payoffs_simple() function from previous classworks:

move <- function(cell, action) {
  if (action == "stay") {
    return(cell)
  } else if (action == "south") {
    if (cell <= 6) {
      return(cell + 3)
    } else {
      return(cell)
    }
  } else if (action == "north") {
    if (cell >= 4) {
      return(cell - 3)
    } else {
      return(cell)
    }
  } else if (action == "east") {
    if (cell %in% c(1, 2, 4, 5, 7, 8)) {
      return(cell + 1)
    } else {
      return(cell)
    }
  } else if (action == "west") {
    if (cell %in% c(2, 3, 5, 6, 8, 9)) {
      return(cell - 1)
    } else {
      return(cell)
    }
  }
}

payoffs_simple <- function(position) {
  c(0, 0, 0,
    1, 2, 1,
    0, 0, 0)[position]
}

Part 2: Turn our `simulation` function into an episode generator.

In the last classwork, you wrote a function simulation that took a rule and returned a total payoff. Today, we’ll adjust that to return the whole trajectory: cells visited along with the payoffs in each time step. You don’t need to do anything here: just read and run the demo at the bottom.

simulation <- function(policy_function, payoff_function) {
  state <- sample(1:9, size = 1)
  game_continues <- T
  
  # storage
  states <- c()
  payoffs <- c()
  actions_taken <- c()
  next_states <- c()

  while(game_continues) {
    a <- policy_function(state)
    p <- payoff_function(state)
    s_next <- move(state, a)
    
    states <- c(states, state)
    actions_taken <- c(actions_taken, a)
    payoffs <- c(payoffs, p)
    next_states <- c(next_states, s_next)

    game_continues <- sample(c(T, F), size = 1, prob = c(0.9, 0.1))
    state <- s_next
  }
  
  return(
    tibble(
      t = 1:length(states),
      state = states,
      action = actions_taken,
      payoff = payoffs,
      next_state = next_states
    )
  )
}

# Demo: "always stay" policy
simulation(
  policy_function = function(state) "stay",
  payoff_function = payoffs_simple
  )

Run the simulation with an “always north” policy.

simulation(
  policy_function = ___,
  payoff_function = ___
  )

Write a function episode_total_payoff(ep) that takes an episode (tibble) and returns the total payoff for that episode. The function pull() might be helpful: it takes a tibble and a variable and returns that variable as a vector.

episode_total_payoff <- function(ep) {
  ep %>% pull(payoff) %>% ___
}

test_that("episode_total_payoff works", {
  set.seed(1234)
  temp <- simulation(
    policy_function = function(state) "stay",
    payoff_function = payoffs_simple
  )
  expect_equal(episode_total_payoff(temp), 10)
})

Part 3: Start with a “dumb” exploratory policy

Define a baseline behavior policy that does not use any knowledge: random_policy(state) samples uniformly from all actions, letting move() handle walls.

What is the average total payoff under random_policy? Use your function from part 1 update_running_avg.

random_policy <- function(state) {
  ___
}

avg <- 0
for (i in 0:1000) {
  x_new <- simulation(random_policy, payoffs_simple) %>%
    episode_total_payoff()
  avg <- update_running_avg(avg, i, x_new)
}
avg

Part 4: Monte Carlo Learning of \(Q(s, a)\) using the first decision

Up to now, your agent has just been wandering randomly. Next we’ll let it learn from experience!

We’ll build a table \(Q(s, a)\) that answers the question “If the agent starts in state \(s\), takes action \(a\) first, and then behaves randomly after that, what total payoff should it expect to get on average?”

First, initialize a Q table. We want one row for each (state, action) pair: use expand_grid.

init_Q <- function() {
  expand_grid(state = 1:9, action = ___) %>%
    mutate(Q = 0, N = 0)
}

Qtab <- init_Q()

Next, consider: what information do we get from one episode?

simulation(random_policy, payoffs_simple)

You learn s0: the starting state, a0: the first action taken, and G: the total payoff from the entire episode. That will give us one observation about Q(s0, a0).

ep <- simulation(random_policy, payoffs_simple)

s0 <- pull(ep, state)[___]
a0 <- pull(ep, action)[___]
G <- sum(___)

The interpretation is that, starting in state s0, taking action a0 first, the agent earned a total payoff of G.

Write a function to update Qtab given s0, a0, and G using a running average.

update_Q <- function(Qtab, s0, a0, G) {
  idx <- which(pull(Qtab, state) == s0 & pull(Qtab, action) == a0)
  
  Q_old <- pull(Qtab, Q)[___]
  N_old <- pull(Qtab, N)[___]
  
  Qtab$Q[idx] <- update_running_avg(Q_old, N_old, G)
  Qtab$N[idx] <- N_old + 1
  
  return(Qtab)
}

Now train Q by running 5000 episodes. Each episode updates exactly one row of the Q table.

for (i in 1:5000) {
  ep <- simulation(random_policy, payoffs_simple)
  
  s0 <- pull(ep, state)[___]
  a0 <- pull(ep, action)[___]
  G <- sum(___)
  
  Qtab <- update_Q(Qtab, s0, a0, G)
}

You have now learned estimates of \(Q(s, a)\) purely from experience. Use dplyr functions on Qtab to find the action with the highest Q value for each state. How well did trial and error training do here?

Download this assignment

Here’s a link to download this assignment. When you’re done, compile to html (File > Render Document), and upload to Canvas (one copy per group).