[MUSIC]

There are, as it turns out, multiple ways to do so.

The first one is specific to this UCB.

First, in general, you don't need to get the whole distribution.

You only need to know it's percentile.

And you might have to compromise a little bit in the way that you don't

know the exact value, but you only know it for some inequality.

Turns out, in statistics, there is more than one way you can compute that.

There are inequalities like the counting inequality,

also known as Chernoff inequality.

Or or there's going to be like five more of them.

That all compute something like the previnsi of some particular value being

more than some particular threshold.

The case that this particular rate regardless of the distribution.

Or what as in the housing case regardless of a distribution shape

as long as the possible value of this action q value is between 0 and 1.

In this case the formula on the slide is a simplified version of the Hoeffding

inequality here.

You can easily find the full version if you just Wikipedia it.

But the essence here is that the probability of your action value, q-value,

being larger than the average q-value you've measured so far by more than t.

Which is a value of say plus 5.

Would decay as exponent to the power of -2 times number of samples,

number of times you've measured this q-value, times t squared.

So if your t is equal to 5, and say your q-value,

your average q-value is 10, and t is equal to 5.

The [INAUDIBLE] of your q-value being larger than 15 is

exponent of -2 times n times 25.

And n might be some 1,000.

Which is kind of really, really small number but they are non 01 nevertheless.

What you want to do with this inequality is you want to solve it not for

the probability, but for t, for this offset here.

You want to find such an offset, such a difference between expected value and

your action value, q-value.

Such that the probability of being above this offset is greater,

okay lesser or equal than 5%.

Or another chance probability of being within this offset, so less or equal,

is at least 95%.

This would be some estimate of the percentile wanted,

all by it is an inequality so it might be a less efficient one.

If you solved this thing for t, plugging in some probablilty that you want,

say you want to be 99% sure that you are within this range.

Then your solution is going to look something like this.

Of course there can be a coefficient here between the square root.

Which depends on the probability, the certainty you want to get.

And if you want to be 90% it would be less than 1 here, and

if you want to be 9.99999% sure it will be, say, maybe 10 times, 100 times more.

But the general idea looks like this.

You take the amount of times you've picked this action in a state.

And you take the amount of times you've ever been to the state and

picked any action, regardless of which one.

And then you plug them in this square root thing here.

So you take the logarithm over time, over a number of times,

you've been to the state.

Divided by times you've picked this action in the state,

multiplied by two square root, you know all that stuff.

And this is how far you get from the expected value.

So the final and if priority of action is a sum of this exponent value and

the addition from the UCB, from the upper confidence bound.