In the previous video,

we discussed episodic problems.

In many problems however,

the agent environment interaction continues without end.

Today, we will see how such problems can

be formulated as continuing tasks.

In this video, you will learn to

differentiate between episodic and continuing tasks,

formulate returns for continuing tasks using discounting,

and describe how returns at

successive time steps are related to each other.

Let's look at the differences between

episodic and continuing tasks.

As we discussed earlier,

episodic tasks break up into episodes.

Every episode in an episodic task

must end in a terminal state.

The next episode begins

independently of how the last episode ended.

The return at time step t is

the sum of rewards until termination.

In contrast, continuing tasks

cannot be broken up into independent episodes.

The interaction goes on continually.

There are no terminal states.

To make this more concrete,

consider a smart thermostat

which regulates the temperature of a building.

This can be formulated as a continuing task

since the thermostat never

stops interacting with the environment.

The state could be the current temperature

along with details of

the situation like the time of

day and the number of people in the building.

There are just two actions,

turn on the heater or turn it off.

The reward to be minus one every time someone has to

manually adjust the temperature and zero otherwise.

To avoid negative reward,

the thermostat would learn to

anticipate the user's preferences.

So how can we formulate the return for continuing tasks?

We can try to sum up

all the future rewards as we did for episodic tasks.

But now, we're summing over an infinite sequence.

This return might not be finite.

So how can we modify this sum so that is always finite?

One solution is to discount

future rewards by a factor

Gamma called the discount rate.

Gamma is at least zero,

but less than one.

The return formulation can then be

modified to include discounting.

The effect of discounting on the return is simple,

immediate rewards contribute more to the some.

Rewards far into the future contribute

less because they are multiplied by

Gamma raised to

successively larger powers of k. Intuitively,

this choice makes sense.

A dollar today is worth more

to you than a dollar in a year.

We can concisely write this sum as this expression,

which is guaranteed to be finite. Let's see why?

Assume R_max is the maximum reward

our aging can receive at any time step.

We can now upper bound the return

G_t by replacing every reward with R_ max.

Since R_max is just a

constant we can pull it out of the summation.

Note that the second factor is

just a geometric series and

the geometric series evaluates to

one divided by one minus Gamma,

R_max times one divided by one minus Gamma is

finite and is an upper bound on G_t.

So we know G_t is finite.

Now, let's look at the effect of

the discount factor on the behavior of the agent.

We can look at the two extreme cases

when Gamma equals zero and when Gamma approaches one.

When Gamma equals zero the return is

just the reward at the next time step.

So the agent is shortsighted and only

cares about immediate expected reward.

On the other hand,

when Gamma approaches one,

the immediate and future rewards are

weighted nearly equally in the return.

The agent in this case is more farsighted.

Finally, let's discuss

a simple but important property of the return.

It can be written recursively.

Let's factor out Gamma starting

from the second term in our sum.

Amazingly, the sequence in

parentheses is the return on the next time step.

So we can just replace it with G_t plus 1.

Now, we have a recursive equation with G_t on

the left and G_t plus 1 on the right.

This simple equation is more powerful than it seems.

In future videos, we'll exploit

this equation to design learning algorithms.

To recap, we learned about continuing tasks where

the agent environment interaction goes on indefinitely.

Discounting is used to ensure returns are finite and

we saw that returns can be defined

recursively. See you next time.