Previously, we learned about

how optimal policies achieve the goal of

reinforcement learning to obtain as

much for more as possible in the long run.

In this video, we'll describe

the related notion of an optimal value function,

and introduce an associated set of Bellman equations.

By the end of this video,

you will be able to derive

the Bellman optimality equation

for the state-value function,

derive the Bellman optimality equation

for the action-value function,

and understand how the Bellman optimality equations

relate to the previously introduced Bellman equations.

Recalling that for two policies Pi_1 and Pi_2.

Pi_1 is considered as good as or

better than Pi_2 if and only if

the value under Pi_1 is greater than or

equal to the value under Pi_2 for all states.

An optimal policy is one which as

good as or better than every other policy.

The value function for the optimal policy

thus has the greatest value possible in every state.

We can express this mathematically,

by writing that vPi star of S is

equal to the maximum value over all policies.

This holds for every state in our state-space.

Taking a maximum over policies might not be intuitive.

So let's take a moment to break down what it means.

Imagine we were to consider

every possible policy and

compute each of their values for the state

S. The value of an optimal policy

is defined to be the largest of all the computed values.

We could repeat this for every state and the value of

an optimal policy would always be the largest.

All optimal policies have

this same optimal state-value function,

which we denote by v star.

Optimal policies also share

the same optimal action-value function,

which is again the maximum

possible for every state action pair.

We denote this shared action value function by q star.

Recall the Bellman equation for the state value function.

This equation holds for the value function of

any policy including an optimal policy.

By substituting the optimal policy Pi star

into this Bellman equation,

we get the Bellman equation for v star.

So far, we haven't done anything special.

This is simply the Bellman equation we introduced

previously for the specific case of an optimal policy.

However, because this is an optimal policy,

we can rewrite the equation in a special form,

which doesn't reference the policy itself.

Remember there always exists

an optimal deterministic policy,

one that selects an optimal action in every state.

Such a deterministic optimal policy

will assign Probability 1,

for an action that achieves

the highest value and Probability 0,

for all other actions.

We can express this another way by replacing

the sum over Pi star with a max over a.

Notice that Pi star no longer appears in the equation.

We have derived a relationship that

applies directly to v star itself.

We call this special form,

the Bellman optimality equation for v star.

We can make the same replacement in

the Bellman equation for the action-value function.

Here the optimal policy appears in the inner sum.

Once again, we replace the sum over

Pi star with a max over a.

This gives us the Bellman optimality equation for q star.

In an earlier lecture,

we discussed how the Bellman equations form

a linear system of equations that

can be solved by standard methods.

The Bellman's optimality equation gives us

a similar system of equations for the optimal value.

One natural question is,

can we solve this system in

a similar way to find the optimal state-value function?

Unfortunately, the answer is no.

Taking the maximum over

actions is not a linear operation.

So standard techniques from linear algebra

for solving linear systems won't apply.

In this course, we will not form and solve

systems of equations in the usual way.

Instead, we will use other techniques based on

the Bellman equations to compute

value functions and policies.

You might be wondering why we can't simply use Pi star in

the ordinary Bellman equation to get

a system of linear equations for v star.

The answer is simple.

We don't know Pi star.

If we did, then we would have already

achieved the fundamental goal of reinforcement learning.

If we can manage to solve

the Bellman optimality equation for v star,

we can use the result to obtain Pi star fairly easily.

We will see how this can be done in the next video.

So we introduced optimal value functions

and derive the associative Bellman optimality equations.

The Bellman optimality equations relate

the value of a state or state-action pair,

to it's possible successors under any optimal policy.

See you next time, where we'll learn how to find

the optimal policy from the optimal value function.