4.3 Dynamics Part 2

Lemon Tree Problem

Transition Probabilities

Consider a new version of the lemon tree Markov Chain. The first row/column corresponds to having 0 lemons. The second corresponds to having 1 lemon, and the third and fourth correspond to having 3 then 6 lemons. Having 6 lemons is absorbing, and it’s impossible to get exactly 2, 4, or 5 lemons. Basically what’s going on is that the more lemons there are on the tree, the more lemons will grow (up to a maximum of 6).

\[\begin{equation} T_{water} = \begin{bmatrix} p0 & p1 & p2 & 0\\ 0 & p0 & p1 & p2\\ 0 & 0 & p0 & 1 - p0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} \end{equation}\]

So if you water your tree and you had 0 lemons on it, next period you’ll have 0 lemons w.p. (“with probability”) p0, you’ll have 1 lemon w.p. p1, and you’ll have 3 lemons w.p. p2.

If you harvest your lemons, you reset the process to 0, but I’ll make this assumption to simplify our math later on: after harvesting, you only end up with 0 lemons on your tree w.p. p0 (you get 1 and 3 lemons w.p. p1 and p2). So just to emphasize once again, harvesting can result in 0, 1, or 3 lemons on your tree the very next period.

Selling Lemons

Suppose also that you can sell your lemons when you harvest them for $1 per lemon. So if you harvest when you have 1 lemon on your tree, you can make $1. And if you harvest when you have 6 lemons on your tree, you can make $6.

The game will continue for an infinite number of periods. But I’ll introduce a discount factor of $\beta = 0.9$ so that your expected earnings from playing the game converge. The discount factor is a measure of time preference (“patience”), but you can also think of it as simply the probability the game continues from one period to the next. So while the game theoretically continues for an infinite number of periods, you only survive from one period to the next w.p. $\beta = 0.9$.

The question is: what’s the optimal policy rule in the lemon tree game? That is, given a number of lemons on your tree, what should you do to maximize the amount of money you can earn?

Should you harvest when you have 6 lemons? Yes: you can’t accumulate more than 6 lemons, so watering at that point would be a wasted turn. You should harvest, earn $6 ($1 per lemon), and start over accumulating more lemons.
But should you also harvest when you have 3 lemons on your tree? Maybe: you can earn $3 now, but you’d reset your lemons and give up being able to sell 6 later on.

Exercise 1: Think about this problem a little and write down your thoughts. Your candidates for the optimal policies might be these: harvest1, harvest3, and harvest6, that correspond to “harvest as soon as you have at least one lemon”, “harvest as soon as you have at least 3 lemons”, and “wait until you have 6 lemons to harvest”. Does the expected return to harvest6 depend positively or negatively on p0, p1, or p2?

Exercise 2: Would the optimal policy ever be something different from harvest1, harvest3, or harvest6? For instance, consider a policy where you water until you get 3 lemons and then water just one more time to try for 6. If you fail to get 6, you harvest on your next turn. Could that policy ever be optimal?

Exercise 3: Write a function transition() that takes probabilities p0, p1, and p2 (assume they sum to 1) and returns a probability transition matrix populated with those probabilities. Make sure it looks like $T_{water}$ from above.

transition <- function(p0, p1, p2) {
#   matrix(___)
}

# t <- transition(.8, .1, .1)

The Value Function

The value function is the expected value going into the future of being in each state and playing a certain policy. It turns out that you can solve for the value function by guessing and then updating your guess a few thousand times because the value function is a contraction mapping, so it will always eventually converge to the true value.

Value Iteration Algorithm

There are just 4 parts to this algorithm: value, value_harvest, value_grow, and future_value. They are all 4x1 matrices and I’ll explain each of them in turn.

`value`

First, value will be a 4x1 matrix that tells you how valuable it is to be in each state (there are 4 states: 0 lemons, 1, 3, and 6) assuming you behave optimally thereafter. To calculate value, we’ll take the elementwise (pmax instead of max) maximum of value_grow and value_harvest: we consider taking either action in each state and we select the one that gets us the higher value.

value <- pmax(value_grow, value_harvest)

`value_harvest`

value_harvest is a 4x1 matrix that tells you how much money you can expect to make if you were in each of the four states and you chose to harvest your lemons, and then you behaved optimally after that. For example, if you were in the state with 6 lemons and you harvested, you’d make $6 and then in the next period, you’d start over and you would have 0 lemons w.p. p0, 1 lemon w.p. p1, and 3 lemons w.p. 3. I’ll call the value of starting over with the process future_value[1] = p0 * value[1] + p1 * value[2] + p2 * value[3].

So value_harvest for the state with 6 lemons will be 6 + beta * future_value[1]. We multiply the value of starting over by discount factor $\beta = 0.9$ because we only get that value if we survive into that period, which happens w.p. $\beta$.

value_harvest <- c(0, 1, 3, 6) + beta * future_value[1]

`value_grow`

value_grow is another 4x1 matrix. It tells you how much money you can expect to make if you were in each of the four states and you choose to grow (water) your lemons, and then you behaved optimally after that. Because you haven’t harvested, you don’t get any payoffs in the current turn, but you set yourself up to earn more payoffs later on because you’re growing more lemons. So you just get the (discounted by $\beta$) future value of being in each state.

value_grow <- beta * future_value

`future_value`

future_value is the final piece. It’s another 4x1 matrix and, as I mentioned in the part on value_harvest, it’s the amount you can expect to make into the future after leaving each state and growing your lemons. So future_value[1] is the amount you can expect to make after leaving the state with 0 lemons: it’s p0 * value[1] + p1 * value[2] + p2 * value[3], or in words, it’s the expected value of being in the states with 0, 1 and 3 lemons (weighted by the probabilities you’ll end up in each of those states).

If you let t be the transition probability matrix, this is the equation you’ll use to calculate future_value:

future_value <- t %*% value

Exercise 4: I’ve told you that future_value[1] = p0 * value[1] + p1 * value[2] + p2 * value[3]. What will future_value[2], future_value[3], and future_value[4] be?

Putting it all together

Exercise 5: In the while loop below, I work with 2 versions of value: value and value_update. I do this so that I can update it and then compare the two to check for convergence. Your job is to fill in the blanks to finish the while loop according to the rules above.

while (max(abs(value_next - value)) > .001) { # Continues until value and value_next converge
    value <- value_update
    future_value <- t %*% ___
    value_grow <- beta * ___
    value_harvest <- c(0, 1, 3, 6) + ___
    value_update <- pmax(___, ___)
  }

Exercise 6: Take p0 = 0.8, p1 = 0.1, and p2 = 0.1. Start with guesses for value and value_update like matrix(rep(0, 4), nrow = 4) and matrix(rep(1, 4), nrow = 4). The two matrices need to exist and they need to not be equal before the while loop above will begin doing the value iteration. Now use the while loop to find value.

Now you know how to use value iteration to find the value function! But how does the value function help you understand the optimal policy (whether you should grow or harvest in each state)?

Here’s the answer: Recall that value_harvest is the value of being in each state, harvesting the lemons, and behaving optimally from then on out. And value_grow is the value of being in each state, growing the lemons, and then behaving optimally. So in each state where value_harvest > value_grow, you should harvest instead of grow the lemons.

Exercise 7: Continuing from exercise 6, find the optimal policy when p0 = 0.8, p1 = 0.1, and p2 = 0.1.