update_running_avg <- function(avg, n, x_new) {
___
}
library(tidyverse)
library(testthat)
test_that("updating a running average of 3 with 4 data points with a new value of 1 works", {
expect_equal(update_running_avg(3, 4, 1), 2.6)
})Replace Rules with Trial and Error Learning
Instead of hard coding a policy with a rule function, the agent will learn what is good by running lots of episodes, collecting outcomes, and updating beliefs from experience.
Key ideas:
- An episode is a starting state, a sequence of states and rewards, followed by termination.
- Trial and error learning is about policy evaluation from simulation: “If I follow this policy, what payoff do I get on average?”
- Action values \(Q(s, a)\) is the expected total payoff if the agent takes action \(a\) in state \(s\) and then follows current behavior
- Learning is updating a table with running averages.
Part 1: Running averages
Understanding how to compute running averages will be very helpful for this assignment.
Suppose the average of 4 data points is the value 3:
sum(x)/4 = 3. Then you get a fifth data point that is a 1. What is the updated average?Use your logic from part a) to write a function
update_running_avg(avg, n, x_new). Test that it works with the values from part a).
GridWorld
For this classwork, use the same GridWorld you worked on for last classwork (no shocks yet):
| 0 | 0 | 0 |
| 1 | 2 | 1 |
| 0 | 0 | 0 |
We’ll use the same move() and payoffs_simple() function from previous classworks:
move <- function(cell, action) {
if (action == "stay") {
return(cell)
} else if (action == "south") {
if (cell <= 6) {
return(cell + 3)
} else {
return(cell)
}
} else if (action == "north") {
if (cell >= 4) {
return(cell - 3)
} else {
return(cell)
}
} else if (action == "east") {
if (cell %in% c(1, 2, 4, 5, 7, 8)) {
return(cell + 1)
} else {
return(cell)
}
} else if (action == "west") {
if (cell %in% c(2, 3, 5, 6, 8, 9)) {
return(cell - 1)
} else {
return(cell)
}
}
}
payoffs_simple <- function(position) {
c(0, 0, 0,
1, 2, 1,
0, 0, 0)[position]
}Part 2: Turn our simulation function into an episode generator.
In the last classwork, you wrote a function simulation that took a rule and returned a total payoff. Today, we’ll adjust that to return the whole trajectory: cells visited along with the payoffs in each time step. You don’t need to do anything here: just read and run the demo at the bottom.
simulation <- function(policy_function, payoff_function) {
state <- sample(1:9, size = 1)
game_continues <- T
# storage
states <- c()
payoffs <- c()
actions_taken <- c()
next_states <- c()
while(game_continues) {
a <- policy_function(state)
p <- payoff_function(state)
s_next <- move(state, a)
states <- c(states, state)
actions_taken <- c(actions_taken, a)
payoffs <- c(payoffs, p)
next_states <- c(next_states, s_next)
game_continues <- sample(c(T, F), size = 1, prob = c(0.9, 0.1))
state <- s_next
}
return(
tibble(
t = 1:length(states),
state = states,
action = actions_taken,
payoff = payoffs,
next_state = next_states
)
)
}
# Demo: "always stay" policy
simulation(
policy_function = function(state) "stay",
payoff_function = payoffs_simple
)- Run the simulation with an “always north” policy.
simulation(
policy_function = ___,
payoff_function = ___
)- Write a function
episode_total_payoff(ep)that takes an episode (tibble) and returns the total payoff for that episode. The functionpull()might be helpful: it takes a tibble and a variable and returns that variable as a vector.
episode_total_payoff <- function(ep) {
ep %>% pull(payoff) %>% ___
}
test_that("episode_total_payoff works", {
set.seed(1234)
temp <- simulation(
policy_function = function(state) "stay",
payoff_function = payoffs_simple
)
expect_equal(episode_total_payoff(temp), 10)
})Part 3: Start with a “dumb” exploratory policy
Define a baseline behavior policy that does not use any knowledge: random_policy(state) samples uniformly from all actions, letting move() handle walls.
- What is the average total payoff under
random_policy? Use your function from part 1update_running_avg.
random_policy <- function(state) {
___
}
avg <- 0
for (i in 0:1000) {
x_new <- simulation(random_policy, payoffs_simple) %>%
episode_total_payoff()
avg <- update_running_avg(avg, i, x_new)
}
avgPart 4: Monte Carlo Learning of \(Q(s, a)\) using the first decision
Up to now, your agent has just been wandering randomly. Next we’ll let it learn from experience!
We’ll build a table \(Q(s, a)\) that answers the question “If the agent starts in state \(s\), takes action \(a\) first, and then behaves randomly after that, what total payoff should it expect to get on average?”
- First, initialize a Q table. We want one row for each (state, action) pair: use
expand_grid.
init_Q <- function() {
expand_grid(state = 1:9, action = ___) %>%
mutate(Q = 0, N = 0)
}
Qtab <- init_Q()Next, consider: what information do we get from one episode?
simulation(random_policy, payoffs_simple)You learn s0: the starting state, a0: the first action taken, and G: the total payoff from the entire episode. That will give us one observation about Q(s0, a0).
ep <- simulation(random_policy, payoffs_simple)
s0 <- pull(ep, state)[___]
a0 <- pull(ep, action)[___]
G <- sum(___)The interpretation is that, starting in state s0, taking action a0 first, the agent earned a total payoff of G.
- Write a function to update Qtab given s0, a0, and G using a running average.
update_Q <- function(Qtab, s0, a0, G) {
idx <- which(pull(Qtab, state) == s0 & pull(Qtab, action) == a0)
Q_old <- pull(Qtab, Q)[___]
N_old <- pull(Qtab, N)[___]
Qtab$Q[idx] <- update_running_avg(Q_old, N_old, G)
Qtab$N[idx] <- N_old + 1
return(Qtab)
}- Now train Q by running 5000 episodes. Each episode updates exactly one row of the Q table.
for (i in 1:5000) {
ep <- simulation(random_policy, payoffs_simple)
s0 <- pull(ep, state)[___]
a0 <- pull(ep, action)[___]
G <- sum(___)
Qtab <- update_Q(Qtab, s0, a0, G)
}- You have now learned estimates of \(Q(s, a)\) purely from experience. Use
dplyrfunctions on Qtab to find the action with the highest Q value for each state. How well did trial and error training do here?
Download this assignment
Here’s a link to download this assignment. When you’re done, compile to html (File > Render Document), and upload to Canvas (one copy per group).