For the next few minutes, I want you to get distracted

from all those banner ad placements and

small channel banners and consider

a more intuitive notion of how those exploration strategies work.

And as a result,

we'll finally find out how Epsilon-Greedy strategy

sucks not only at the theoretical level, but also practically.

Imagine you are solving this labyrinth,

the one on the slide,

and your initial state is here,

the bottom left corner,

while the terminal state at which you get positive reward is at the top right.

And let's say you're solving this mathematician process,

with discounted reward setting,

with gamma equals 0.99.

The only transition which actually gives you rewards is exiting this labyrinth.

So, getting out from this top right corner,

going upwards from the top right corner,

gives you a reward, say plus 100.

Let's also say that you get minus one if you bump into walls, but for now,

let's stick with this single non-zero rewards and all other actions are worth zero.

Of course, the optimal action is following the trajectory of

shortest path between the bottom left and top right corners,

but let's imagine how hard it is to learn this trajectory by trial and error.

Imagine you employ Q-learning here,

and using an Epsilon-Greedy strategy with any abstraction,

say Epsilon of one.

So you're always taking a random action,

you want to explore as fast as possible,

and you expect it to converge to this strategy where it follows optimal path.

The question is, how long does it take to learn this?

How long does it take to converge to some policy,

which gets at least non-zero reward?

Yes. Again, too bad.

It's a hell of a lot.

It's going to be comparatively too large,

saying the optimal path is,

okay the average path of random walk is going to be like,

say from the top of our head,

50 steps long and you have four actions.

It is going to be roughly four to the power of 50,

which is a terrible,

terrible, terrible number if you want to actually do this many explorations.

And this happens because if you're exploring randomly,

you have a high probability of repeating yourself,

you have a high probability of going backward,

of visiting the same corner twice, thrice and so on.

And then if you land on some trajectory that gets you to the end,

your Q-learning algorithm will need a lot of

repeated experience of getting there to actually run the full Q-function,

unless, of course, if you experience your play,

which kind of simplifies the problem.

But, nevertheless, in this simpler labyrinthine case, Epsilon-Greedy policy,

the idea of exploring brings you a lot

of repeated exploration that you don't really need.

So, you can go right, then go left,

then go right again because you just rolled

this epsilon again and the random action was right.

Now, this is, of course a labyrinth.

It has little to do with any practical case.

Let's get a little bit more practical, although not entirely.

Let's get to some video games in Atari.

There's this Atari game called the Montezuma's Revenge.

It's known for its tremendous difficulty and unforgiving setup,

in which your character, this red guy,

has to walk over all those ladders and not get caught by this apparently aimless cow,

and get the key and then move it somewhere, whatever.

It has to get some huge sequence of actions.

And to get reward it has to get all those actions done right.

So, it won't get reward even if it goes half the way to the key.

It's kind of reasonable. So, what did humans do?

Okay, I guess any human, well, A,

you're saying humans who have ever held a video game would eventually

understand what is it it has to explore and jump to the key right away.

But not the DQN or any other algorithm we've been using so far.

In fact, the DQN score at this game is exactly zero and this is not that good.

Again, it never reached the key at all.

This is bad because humans can easily solve it, but machines cannot.

This means that we're missing something in our reinforcement learning framework.

Well, it turns out there is a lot of more practical things that humans

are much better at than reinforcement learning algorithms.

For example, you guys,

if you are going through this specialization one course a time,

you've probably learned about variational autoencoders,

in duration of one session,

one life and you've learned them into a state where you can actually reproduce them,

code them, and make them run, hopefully.

Or if not autoencoders,

you probably learned to write or read and

do all those complex things with only a little experience available.

And, of course, you had to explore some stuff to do so.

Of course, when you were a child you've probably explored all kinds of stuff,

but at this point of time you still do some exploration.

For example, you've decided to take on this course,

which means that you're willing to explore the new kind of math.

Basically, let's imagine that you were Q-learning algorithm.

In order to learn variational autoencoders,

in order to learn how to write the paper about variational autoencoders,

you had to generate, okay,

some B whatever is in random sequences of numbers with Epsilon-Greedy policy,

until one of them just happened to be the exact text of

the variational autoencoder code or paper or whatever.

Which is not the way you want to learn it.

The trick here is that, unlike Q-learning,

unlike any other algorithm we've learned,

that humans don't explore with Epsilon-Greedy policy.

Neither the explorer with Boltzmann or anything.

We don't have those heuristics doing them.

They have some other heuristics.

Let us see how humans explore.

Maybe you're an Epsilon-Greedy human, I guess,

a human that has decided to explore and has two perspective options

to improve its knowledge about physics and get some cool stuff done in this area.

You have two options.

The first one is, you've just read that the day before yesterday

some cool physicists have announced that there's some new particle and

they'll run on their Large Hadron Collider and no one knows anything about it.

So, it's a complete black box for now.

They don't even know how it appears,

but they have it in the data.

Alternatively, you could try to solve another problem.

The problem is whether you can make yourself fly if you pull your hair up real strong.

So, that's still technically somewhat related to physics.

Which of those are you going to prefer if you had,

say, a year free for exploration.

Hopefully, the first one, of course.

Thing is, humans are not Epsilon-Greedy

and they are not indifferent to what they explore.

In this case, you can get a lot of, well,

there's a lot of different explanations on why you would pick the first one,

and we're only going to cover some of them.