Now, another huge problem would be
DQN is that it actually tries to approximate a set of values that are very interrelated.
This is this point. Let's watch this video,
and in this video,
I want you to take a closer look in how
the Q values change on the bottom-left part of the screen.
I want you to find the segments where they are
more or less equal and the same as they differ as much as possible.
Now, there will be a quiz here to check out that you've made it through the video.
So, what you might have noticed is the fact that most of the time,
especially if the ball in breakout is on the opposite side of the game field,
you'll have Q values for actions more or less on the same value.
This is because in this state,
one action, even a stupid action won't change anything.
Even if you make one bad move while the ball is on the other side of the field,
you'll still have plenty of time to adjust to go the right way and fix the issue.
Therefore, all Q values are more less the same.
There's this common parts which all of those action values have.
This is the state value by the definition we have introduced it previously.
Now, there are also situations where the Q values differ a lot.
These are the cases where one action can make it or break it.
Maybe the ball is approaching your player here, batch or whatever,
your platform, and if you move, say, to the right,
you're going to move at just the right position to catch it and so it bounces off.
If you don't, you'll just miss the ball and lose one life.
So, there are rare cases where those Q values are highly different.
The problem here is that we are considering them as
more or less independent predictions that we would train our network.
So, let's then try to introduce some of
this intuition into how we train the Q network to see if it helps.
This brings us to another architecture.
It's called the dueling deep Q network.
The first thing we have to do is we have to decompose
our Q(s,a), the action value function.
This time we rewrite it as a sum of the state value function,
V(s) and it only depends on the state,
and the neutral which is this capital A(s,a).
The capital A here is the advantage function,
and the intuition here is the advantage is how
much you're action value differs from the state value.
For example, if you have a state in which you have two actions,
the first brings you returns of,
say, plus 100 and the second is plus 1.
Say, you are in a room and you have two doorways.
The first one leads you to a large cake,
and the second, to a small cookie.
After you take each of those actions,
the other opportunity gets lost.
Say, a door closes on the option you have not picked.
Now, in this case,
both action values are positive because you get positive reward.
Say, plus 100 for the cake and plus 1 for the cookie.
The advantages on the contrary are going to be different.
Some of the advantage of the suboptimal action is going to be negative.
This is because if you take this suboptimal action,
then you get the action value of plus 1 minus the state value of plus 100,
which is minus 99.
Basically, it tells you that you have just lost
99 potential units of rewards in this case.
Now, the definition here suggests that the value function that we use is the V star,
the value of our optimal action.
But you can substitute it to any definition of value function so
long as it corresponds with the Q function and you understand what you're doing.
The way we're going to introduce this intuition
into a neural network is basically this way.
We have the usual DQN,
which simply tries to [inaudible] all Q functions,
Q values independently via the insulator,
and then we are going to modify it using our new decomposition.
Now, one unit will have one hand,
which only predicts the state value function which is just one number per state,
and then a set of all the advantages.
To predict those advantages, we actually have to constrain them
in a way that satisfies the common sense of reinforcement learning.
In case of V star,
the maximum possible advantage value is zero because you can never
get action value which is larger than
maximum over all possible action values from the state.
This is basically the definition of the state value of the optimal policy.
Now, what you do is you train those two halves
separately then you just add them up to get your action values.
In fact, you do the opposite thing.
You train them together.
You basically add them up and you minimize the same temporal difference error,
which we used in the usual DQN,
the mean squared error between Q value and improved Q value.
This basically, it starts neural network to approach this problem the right way.
By right, I mean that it should have some separate neurons that's
only solve the problem of how good is it to be in the state.
Another [inaudible] neuron is that,
or basically say that a particular action is better than another action.
Now, this is basically the whole idea behind dueling DQN.
The only difference is that you may define
those advantages and value functions differently.
The option we just covered is the maximization.
You take the constraint that suggest that the maximum advantage is zero.
We can also say that for example,
the evaluation value should be zero by substituting them in format.
It would roughly correspond to some of policy expected value also based action value.
Well, it has actually state value and advantage.
This technically makes some sense but in most cases,
just he receive that proves to be slightly better on [inaudible] problems.
So, here's how dueling DQM works.
We simply introduce those two intermediate players and then we add them up,
and by introducing them separately,
we hid the network that is there should be and of
two important things that may not be that much interdependent.
So, this is the dueling DQN.
Now, the final trick for this video is going to tackle
another issue that we have not yet improved on since the basic Q learning.
It's the issue of exploration.
The problem that we have with the DQN is that,
while it's so complicated, unilateral goal this,
[inaudible] stuff, it still explores in a rather shallow way.
The problem is that if you use,
for example, Epsilon [inaudible] policy,
then the probability of taking one action
which is sub optimal is Epsilon divided by number of actions,
with the probability of, say,
taking five or 10 suboptimal actions in a row is going to be near zero,
because basically Epsilon to the power of this amount of actions, five or 10.
If Epsilon is 0.1,
then you can do the math and see how small it gets.
The problem is that sometimes it takes to actually make those bold steps,
make a few suboptimal,
similarly suboptimal action to discover something
new which is not near the optimal policy but which is
a completely new policy in a way that it approaches the entire decision process.
The Epsilon [inaudible] strategy is very unlikely to discover this.
It is very prone to local optimal convergence.
There is one possible way can solve it.
The one way which fits well with
the deep Q network architecture is so-called Bootstrap DQN.
The [inaudible] here is that you have to train a set of key,
say five or 10 Q value predictors,
that all share the same main body parts,
so they all have same revolutional layers.
The way they are trained is at the begin of every episode,
you pick on them hard,
so you basically flip a,
you throw a dice and you pick one of those key has.
Then you follow its actions and you
train the weights of the head and the weights of the corresponding body.
Then you basically do this thing for the entire episode.
This way, you want the heads to going to be slightly
better but also your features are going to change as well.
Then at the beginning of the next episode,
you pick another head.
So you re-throw dice again and
see what happens and then you follow the policy of this new head.
Since those heads are not directly trained on one of these experiences,
they're not guaranteed to be the same so far as
the network has some possible ways to find different strategies.
So, you can expect them until [inaudible] when they convert them to a policy,
you can expect them to differ,
and this difference is going to be systematic.
So, they wont just take some optimal actions, say,
one time out of 100,
but they'll be fundamentally different in the way
which we prioritize some actions over the others.
Now, this simply repeat this process over all the episodes.
So, you pick an episode, pick a head,
train this head, train the body,
pick another head, and so on.
This process is, in fact, for DQN,
much cheaper than training,
say, K separate agents from scratch.
Because most of the heavy lifting
is actually done on this common body part, the future learning.
While this future part is getting trained on all arbitrations,
because all the heads are connected to it,
then expect the overall progress to be as fast,
almost as fast as the usual DQN.
Maybe even faster because better exploration usually means better policy and less time.
Now, this whole thing has a nickname of Deep Exploration Policy,
Deep Exploration Strategies because, again,
it is able to take a lot of correlated actions that are different from the other heads.
But otherwise, it's still more or less a heuristic that somehow works.
We'll link to explanations of
this article in a greater level of detail in varying section as usual.
Of course, you may expect to find dozens of [inaudible] architecture's,
you may even come up with your new DQN flavor yourself because as of 2017,
they are still getting published.
The last one I know were from the ICML conference from this very year.
The ideas of those architectures are usually that they spot some problems,
some issues, some way you can improve,
then proves this particular issue,
which proves us two things.
First, that the principles of learning gets developed really rapidly,
and the second one is that the DQN architectures float in all the ways you can imagine.
Of course, we'll find an alternative solution right next week,
but until then, you have to get a little bit more acquainted with the DQN.