The one we're going to study now is the fact that the first option is unknown.
There is this new particle no one knows about it.
Well, the second one, it turns out most people have
tried doing this some time in their childhood years.
And they might not remember it
directly or this they may have read about it or have seen it in other person.
They probably know the outcome of
the second research opportunity pretty good to not try this at home anytime soon.
So, we want to partase actions that are uncertain
whose outcomes are not yet that well known to us.
And to actually implement this in any practical algorithm,
we won't just require the Q values themselves.
Just want the probability of Q value,
it is Bayesian sense.
So basically our belief expressed as a probability,
the the Q value is going to turn out to be this or that number.
In the plot here at the bottom of the slide,
in one particular stage you can see three probability distribution for actions,
that each should present, our belief about
each particular actions Q value in a particular state.
Again for the third time we emphasize it is a Bayesian probability.
It means that, the variance the breadth of
this curve doesn't represent the randomness in the action itself,
but only our belief.
It means that if the green action is
actually deterministic but we have never tried it yet so we have no idea,
or we've only tried it a few times,
then it means that is going be quite broad anyway.
While the orange action can actually be very noisy.
It can have very wide turn plus and minus 10.
What we are dead sure that it's expected to turn is going be what is it?
Point five or whatever, 0.6.
So it's our belief,
expressed as a probability distribution,
the same way you did in a Bayesian methods course at the beginning of the RFP.
And our challenge here is to pick an action not just given the action values,
but given the beliefs we have about them.
So now we're done with this.
We have this one state and three actions,
and the beliefs of
the Q values of those actions are represented with those distributions.
So these our actual beliefs of what the Q value is going to turn out.
Now I want you to tell me,
which actions of those three are even eligible?
Which of them it makes sense to pick,
regardless of what method we use?
Well, turns out yes.
Thing is, the blue action is oftenly mis-regardless of how we pick it.
Because the orange one dominates into expectation.
So if we want to exploit it would always be
the orange one not the blue one it's just better,
and if want to explore the blue action is dominated by the green one.
So, the green one has some chance of being better than the orange,
if we believe this belief distributions sorry for the totoligy.
But the blue one won't be any better.
So, either a green one,
or the orange one.
Depending on how you prefer to explore and exploit.
So now let's get to some algorithms that actually
decide what's the probability of picking the green action or the orange one,
and hopefully don't pick the blue one ever.
Lets begin with Thompson sampling.
This algorithm is actually a more general one,
but we are only going to study its simpler form for now.
Thompson sampling actually suggests that you take one sample,
from each of those distributions and assuming they are
normal distributions you can just take the sample and if they're imping distributions,
histograms you can just take,
MPR random sample or whatever.
And then you'll get three points,
three Q values of each action.
The Thompson sampling wants you to pick an action
whose sample Q value is going be largest.
The question to you is,
on average if you see those points,
which actions is going be picked the most,
which is going be picked the least, and which working?
What are the probabilities of taking each actions?
Of course there's more than one possible way to interpret those histograms,
but in general, you can more or less say that,
the blue one is going to have a lot of zero,
because regardless of what you sample from it,
there's a chance of one that's a sample from red one will be larger.
So, it's no longer needs to be explored.
Sample from red one is between say,
roughly 0.3 and say 0.8 or nine.
While the sample from green can be anywhere between
minus whatever two plus one and something.
This actually means that,
at some points you will actually pick the green one as
it is better for you to explore it,
and at other points you'll pick the orange.
Ofcourse you can also sample with some temperature use
importance sampling to skew this towards more exploration or more exploitation.
Flatten error thing and this time,
you're going be more exploring.
Or you can sample proportionally to square of each probability and it's be exploiting