Many problems involve some amount of delayed reward.

A store manager could lower their prices and sell off

their entire inventory to maximize short term gain.

But they would do better in the long run by

maintaining inventory to sell when the demand is high.

In reinforcement learning, reward

captures the notion of short-term gain.

The objective however, is to learn

a policy that achieves the most reward in the long run.

Value functions formalize what this means.

By the end of this video, you'll be able to;

describe the roles of

the state value and action value functions

in reinforcement learning,

describe the relationship between

value functions and policies,

and create examples of value functions for a given MDP.

Roughly speaking, a state value function

is the future award

an agent can expect to

receive starting from a particular state.

More precisely, the state value function

is the expected return from a given state.

The agent's behavior will also determine

how much total reward it can expect.

So a value function is

defined with respect to a given policy.

The subscript Pi indicates

the value function is contingent

on the agent selecting actions according to Pi.

Likewise, a subscript Pi on the expectation

indicates that the expectation is

computed with respect to the policy Pi.

We can also define an action value function.

An action value describes what happens

when the agent first selects a particular action.

More formally, the action value of a state is

the expected return if the agent selects action

A and then follows policy Pi.

Value functions are crucial in reinforce learning,

they allow an agent to query the quality of

its current situation instead of

waiting to observe the long-term outcome.

The benefit is twofold.

First, the return is not

immediately available and second,

the return may be random due to stochasticity in

both the policy and environment dynamics.

The value function summarizes

all the possible futures by averaging over returns.

Ultimately, we care most about learning a good policy.

Value function enable us to judge

the quality of different policies.

For example, consider an agent playing the game of chess.

Chess has an episodic MDP,

the state is given by

the positions of all the pieces on the board,

the actions are the legal moves,

and termination occurs when the game

ends in either a win, loss, or draw.

We could define the reward as plus one for

winning and zero for all the other moves.

This reward does not tell us much about

how well the agent is playing during the match,

we'll have to wait until the end of

the game to see any non-zero reward.

The value function tells us much more.

The state value is equal to

the expected sum of future rewards.

Since the only possible non-zero reward

is plus one for winning,

the state value is simply the probability of

winning if we follow the current policy Pi.

In this two player game,

the opponent's move is part of the state transition.

For example, the environ moves both the agents piece,

circled in blue, and

the opponent's piece, circled in red.

This puts the board into a new state, S prime.

Note, the value of state S prime is

lower than the value of state S. This means

we are less likely to win the game from

this new state assuming we continue following policy Pi.

An action value function would allow

us to assess the probability of winning

for each possible move given

we follow the policy Pi for the rest of the game.

To build some intuition,

let's look at a simple continuing MDP.

The states are defined by the locations on the grid,

the actions move the agent up,

down, left, or right.

The agent cannot move off the grid

and bumping generates a reward of minus one.

Most other actions yield no reward.

There are two special states however,

these special states are labeled A and B.

Every action in state A yields plus 10

reward and plus five reward in state B.

Every action in state A and B transitions the agents

to states A prime and B prime respectively.

Remember, we must specify the policy before we

can figure out what the value function is.

Let's look at the uniform random policy.

Since this is a continuing task,

we need to specify Gamma,

let's go with 0.9.

Later, we will learn several ways to

compute and estimate the value function,

but this time we'll be nice to you and computed for you.

On the right, we have written the value of each state.

First, notice the negative values near the bottom,

these values are low

because the agent is likely to bump into

the wall before reaching the distance states A and B.

Remember, A and B are both

the only sources of positive reward in this MDP.

State A has the highest value,

notice that the value is less than 10 wven though

every action from state A generates

a reward of plus 10, why?

Because every transition from A moves the agent

close to the lower wall and near the lower wall,

the random policy is likely

to bump and get negative reward.

On the other hand, the value of

state B is slightly greater than five.

The transition from B moves the agent to the middle.

In the middle, the agent is unlikely to

bump and is close to the high-valued states A and B.

It's really quite amazing how the value function

compactly summarizes all these possibilities.

In this video, we introduce the definitions

of state and action value functions.

Soon, we will discuss how

value functions can be computed.

For now, you should understand

that a state value function

refers to the expected return from

a given state under a specific policy,

and an action value function

refers to the expected return from

a given state after selecting

a particular action and then following a given policy.