4 Dynamics

Richard Bellman, the Inventor of Dynamic Programming

4.1 Introduction

In the last chapter, I developed the theory of a Markov Decision Process: a Markov Process (with Markov Chain transition probabilities) where an agent can select among several alternative actions that each have a set of transition probabilities attached.

In this chapter, I’ll use two solution methods to actually solve an MDP for its optimal policy function: a mapping from state to the best action the agent can take. That action would earn the agent the most “points” (or “reward” or “utility” or “money”) today while setting them up to earn the most points later on.

4.2 Lemon Tree Problem

4.2.1 Transition Probabilities

Consider a new version of the lemon tree Markov Chain from the last chapter. The first row/column corresponds to having 0 lemons. The second corresponds to having 1 lemon, and the third and fourth correspond to having 3 then 6 lemons. Having 6 lemons is absorbing, and it’s impossible to get exactly 2, 4, or 5 lemons. Basically what’s going on is that the more lemons there are on the tree, the more lemons will grow (up to a maximum of 6).

\[\begin{equation} T_{water} = \begin{bmatrix} p0 & p1 & p2 & 0\\ 0 & p0 & p1 & p2\\ 0 & 0 & p0 & 1 - p0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} \end{equation}\]

So if you water your tree and you had 0 lemons on it, next period you’ll have 0 lemons w.p. (“with probability”) p0, you’ll have 1 lemon w.p. p1, and you’ll have 3 lemons w.p. p2.

If you harvest your lemons, you reset the process to 0, but I’ll make this assumption to simplify our math later on: after harvesting, you only end up with 0 lemons on your tree w.p. p0 (you get 1 and 3 lemons w.p. p1 and p2). So just to emphasize once again, harvesting can result in 0, 1, or 3 lemons on your tree the very next period.

4.3 Selling Lemons

Suppose also that you can sell your lemons when you harvest them for $1 per lemon. So if you harvest when you have 1 lemon on your tree, you can make $1. And if you harvest when you have 6 lemons on your tree, you can make $6.

The game will continue for an infinite number of periods. But I’ll introduce a discount factor of $\beta = 0.9$ so that your expected earnings from playing the game converge. The discount factor is a measure of time preference (“patience”), but you can also think of it as simply the probability the game continues from one period to the next. So while the game theoretically continues for an infinite number of periods, you only survive from one period to the next w.p. $\beta = 0.9$.

The discount factor is related to patience because, since there’s only a 90% chance of surviving from one period to the next, a strategy that gets you points now is preferable to a strategy that gets you the same number of points later.

The question is: what’s the optimal policy rule in the lemon tree game? That is, given a number of lemons on your tree, what should you do to maximize the amount of money you can earn?

Should you harvest when you have 6 lemons? Yes: you can’t accumulate more than 6 lemons, so watering at that point would be a wasted turn. You should harvest, earn $6 ($1 per lemon), and start over accumulating more lemons.
But should you also harvest when you have 3 lemons on your tree? Maybe: you can earn $3 now, but you’d reset your lemons back to 0 and give up being able to sell 6 later on.

Exercise 1: Think about this problem a little and write down your thoughts. Your candidates for the optimal policies might be these: harvest1, harvest3, and harvest6, that correspond to “harvest as soon as you have at least one lemon”, “harvest as soon as you have at least 3 lemons”, and “wait until you have 6 lemons to harvest”. Does the expected return to harvest6 depend positively or negatively on p0, p1, or p2?

Exercise 2: Would the optimal policy ever be something different from harvest1, harvest3, or harvest6? For instance, consider a policy where you water until you get 3 lemons and then water just one more time to try for 6. If you fail to get 6, you harvest on your next turn. Could that policy ever be optimal?

You’ll solve this lemon tree problem two different ways in this chapter: first you’ll simulate the game under different policy rules and see which policy seems to earn us more money. Then you’ll use a technique from dynamic programming called value iteration to solve the problem with a little more precision. Value iteration works by finding the expected value of being in each state (number of lemons on your tree) and playing a certain policy. If that expected value is higher for harvesting than for watering, you know you should harvest whenever you find yourself in that state.

4.4 Simulation

Follow these steps to simulate the game and evaluate the returns to the 3 different policy rules.

Exercise 3: First write a function transition() that takes probabilities p0, p1, and p2 (assume they sum to 1) and returns a probability transition matrix populated with those probabilities. Make sure it looks like $T_{water}$ from above.

Exercise 4: Next write a function grow_lemons() that takes a current number of lemons and a transition matrix and returns a new number of lemons assuming you’ve chosen to water. The function should use the probability transition matrix from exercise 3 to sample with the correct probabilities. That is, running grow_lemons(0, t) should return 0 w.p. p0, it should return 1 w.p. p1, and it should return 3 w.p. p2.

Exercise 5: In this exercise, you’ll write functions in R that implement the three policies we’re thinking about as candidates for the optimal policy: harvest1(), harvest3(), and harvest6(). They should take a number of lemons and return TRUE if you should harvest according to that policy rule and FALSE if not. So harvest3(1) should return FALSE because harvest3 means to wait until you have 3 lemons before you harvest. harvest3(3) should return TRUE, and harvest3(6) should also return TRUE.

If you’ve followed the steps above, you should now be able to run this code to simulate a game played using each candidate policy.

sim <- function(policy, t) {
  # This function takes a policy (harvest1, harvest3, or harvest6)
  # and a transition probability matrix t and it simulates playing
  # a game using that policy.
  
  bank <- 0 # Initialize your "bank" at 0 to count the money you make
  lemons <- sample(c(0, 1, 3, 6), size = 1) # and initialize the number of lemons
  
  while (sample(c(FALSE, TRUE), size = 1, prob = c(.1, .9))) { # Discount factor 0.9
    if (policy(lemons)) { # If you harvest:
      bank <- bank + lemons # Add to your bank account
      lemons <- grow_lemons(0, t) # Begin next period with 0 lemons wp p0, 1 wp p1, and 2 wp p2.
    } else { # If you grow:
      lemons <- grow_lemons(lemons, t) # Your lemons increase
    }
  }
  # When the game ends, return the amount of money you've earned.
  return(bank)
}

sim(harvest1, t)
sim(harvest3, t)
sim(harvest6, t)

Exercise 6: Finally use map() (or a for loop storing your results) to run your simulation 1,000 times for each of the policy rules. Which policy earns you the most money when p0 = 0.8, p1 = 0.1, and p2 = .1? What about when p0 = 0.3, p1 = 0.5, and p2 = 0.2?

The problem with the simulation method is that it’s not very efficient. It’s sort of a “brute-force” method. It works well when one policy is clearly much better than the others, but for certain transition probabilities when the return to the optimal policy and the second-best policy is close, the simulation method may give you the wrong answer some of the time.

That’s why we’ll learn about a second solution method that’s much more precise: value iteration via dynamic programming.

4.5 Dynamic Programming

The first thing to know about dynamic programming is that the name is basically meaningless. Here’s an excerpt from Richard Bellman’s autobiography about that:

I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, ‘Where did the name, dynamic programming, come from?’ The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, ‘programming.’ I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities. – Bellman (1984)

So instead of dynamic programming, you may want to think of these concepts as mathematical research of multistage decision processes. To add to the confusion, the term dynamic programming has come to have a similar but distinct meaning in the field of computer science. In my experience at least, if you talk about dynamic programming with a computer scientist, they’re thinking of ways to use memoization (caching solutions to subproblems in perhaps a recursive algorithm). I mention this just to explain why a google search of dynamic programming will yield really varied results.

4.5.1 The Bellman (the value function)

The dynamic programming approach works by finding the expected value going into the future of being in each state and playing a certain policy, then behaving optimally thereafter. If the expected value of being in a certain state and playing option A is higher than any other option, then that is the optimal policy rule for that state.

It turns out that you can solve for the expected value function by guessing and then updating your guess a few thousand times because the expected value function is a contraction mapping, so it will always eventually converge to the true value.

Here’s how it works!

4.6 Value Iteration Algorithm

There are just 4 parts to this algorithm: value, value_harvest, value_grow, and future_value. They are all 4x1 matrices and I’ll explain each of them in turn.

4.6.1 `value`

First, value will be a 4x1 matrix that tells you how valuable it is to be in each state (there are 4 states: 0 lemons, 1, 3, and 6) assuming you behave optimally. To calculate value, we’ll take the elementwise (pmax instead of max) maximum of value_grow and value_harvest: we consider taking either action in each state and we select the one that gets us the higher value.

value <- pmax(value_grow, value_harvest)

4.6.2 `value_harvest`

value_harvest is a 4x1 matrix that tells you how much money you can expect to make if you were in each of the four states and you chose to harvest your lemons, and then you behaved optimally after that. For example, if you were in the state with 6 lemons and you harvested, you’d make $6 and then in the next period, you’d start over and you would have 0 lemons w.p. p0, 1 lemon w.p. p1, and 3 lemons w.p. 3. I’ll call the value of starting over with the process future_value[1] = p0 * value[1] + p1 * value[2] + p2 * value[3].

So value_harvest for the state with 6 lemons will be 6 + beta * future_value[1]. We multiply the value of starting over by $\beta = 0.9$ because we only get that value if we survive into that period, which happens w.p. $\beta$.

value_harvest <- c(0, 1, 3, 6) + beta * future_value[1]

4.6.3 `value_grow`

value_grow is another 4x1 matrix. It tells you how much money you can expect to make if you were in each of the four states and you chose to grow (water) your lemons, and then you behaved optimally after that. Because you haven’t harvested, you don’t get any points in the current turn, but you set yourself up to earn more points later on because you’re growing more lemons. So you just get the (discounted by $\beta$) future value of being in each state.

value_grow <- beta * future_value

4.6.4 `future_value`

future_value is the final piece. It’s another 4x1 matrix and, as I mentioned in the part on value_harvest, it’s the amount you can expect to make into the future after leaving each state and growing your lemons. So future_value[1] is the amount you can expect to make after leaving the state with 0 lemons: it’s p0 * value[1] + p1 * value[2] + p2 * value[3], or in words, it’s the expected value of being in the states with 0, 1 and 3 lemons (weighted by the probabilities you’ll end up in each of those states).

If you let t be the transition probability matrix, this is the equation you’ll use to calculate future_value:

future_value <- t %*% value

Exercise 7: I’ve told you that future_value[1] = p0 * value[1] + p1 * value[2] + p2 * value[3]. What will future_value[2], future_value[3], and future_value[4] be?

4.7 Putting it all together

Exercise 8: In the while loop below, I work with 2 versions of value: value and value_next. I do this so that I can update it and then compare the two to check for convergence. Your job is to fill in the blanks to finish the while loop according to the rules above.

while (max(abs(value_next - value)) > .001) { # Continues until value and value_next converge
    value <- value_next
    future_value <- t %*% ___
    value_grow <- beta * ___
    value_harvest <- c(0, 1, 3, 6) + ___
    value_next <- pmax(___, ___)
  }

Exercise 9: Take p0 = 0.8, p1 = 0.1, and p2 = 0.1. Start with guesses for value and value_next like matrix(1:4, nrow = 4) and matrix(2:5, nrow = 4). The two matrices need to exist and they need to not be equal before the while loop above will begin doing the value iteration. Now use the while loop to show that value converges to c(4, 5.46, 7, 10) for these transition probabilities.

Now you know how to use value iteration to find the value function! But how does the value function help you understand the optimal policy (whether you should grow or harvest in each state)?

Here’s the answer: you can use value to calculate future_value, value_harvest, and value_grow one last time. Recall that value_harvest is the value of being in each state, harvesting the lemons, and behaving optimally from then on out. And value_grow is the value of being in each state, growing the lemons, and then behaving optimally. So in each state where value_harvest > value_grow, you should harvest instead of grow the lemons.

Exercise 10: Continuing from exercise 9, find the optimal policy when p0 = 0.8, p1 = 0.1, and p2 = 0.1.

Exercise 11: Now find the optimal policy when p0 = 0.3, p1 = 0.5, and p2 = 0.2. Do your answers to these questions match your simulation solution (exercise 6 of this chapter)?

4.8 References

Bellman (1984) Eye of the hurricane: an autobiography