The optimal action at time t is therefore given by a value of

at that maximizes the Q-function.

This is sometimes referred to as a greedy policy.

It just picks a current action that maximizes the Q-function without

worrying how this action will impact other actions in the future.

Now, if we substitute the explicit form of the reward function

r into the Bellman optimality equation, we get this equation.

What this equation shows that the Q-function is quadratic in action at.

Therefore, it's easy to maximize with respect to at.

We can also check what happens with these formulas if we

take the limit of this conversion lambda going to 0.

In this limit, we get the first equation shown here.

But now we can replace the optimal Q-function with minus

the portfolio value whose expectation is exactly the mean

option price C hat as we discussed earlier.

Therefore, by using this and flipping the overall sign,

the first equation can be written as the second one.

But the second equation is something that we already saw.

It's a recursive relation for the mean option price that has right

Black-Scholes limit when the time steps delta t are very small.