0:00

In this video, I'm gonna describe the Bayesian Approach to fitting models, using

a simple coin tossing example. If you're already know about the Bayesian

Approach, you can skip this video. The main idea behind the Bayesian Approach

is instead of looking for the most likely setting of the parameters of the model, we

should consider all possible settings of the parameters and try to figure out for

each of those possible settings, how probable it is, given the data we

observed. The Bayesian framework assumes that we

always have a prior distribution for everything.

That is, for any event that you might care to mention, I have to have some prior

probability that, that event might happen. The problem might be very vague.

1:06

It can disagree with the prior. And in the limit, if we get enough data,

however unlikely the prior is, the data can overwhelm it.

And in the end, with enough data, the truth will out.

That is, even if your prior's wrong, you'll end up with the right hypothesis.

But that may take an awful lot of data if you thought that things were very unlikely

under your prior. So let's start with a coin tossing

example. Suppose you don't know anything about

coins except that they can be tossed and when you toss a coin you get either a head

or a tail. And we're also going to assume you know

that each time you do that it's an independent decision.

2:20

The frequentest answer, which is also called maximum likelihood, is to pick the

value of p that makes the observations most probable.

And that value of p is.53. It's not obvious that's true, let's derive

that. So the probability of a particular

sequence that contains 53 heads and 47 tails could be written out by writing down

p every time you toss a head. And 1-p every time you toss a tail.

And then if we collect all the p's together, and all the 1-p's together, we

get p^53, and 1-p^47. If we now ask, how does the probability of

observing that data depend on p, we can differentiate with respect to p, and we

get the expression shown here, and if we then set that derivative to zero.

We discover that the probability of the data is maximized by setting P to be.53.

So that's maximum likelihood. But there's some problems with using

maximum likelihood to decide on the parameters of a model.

4:10

We don't know much. We don't have much data, and so we're

unsure about what the value of P is. So what we really ought to do is refuse to

give a single answer and instead give a whole probability distribution across

possible answers. An answer like 0.5 is fairly likely.

An answer like one is maybe still pretty unlikely if we have some prior belief that

coins come down heads half the time. So, now I'm going to go through an example

where we start with some prior distribution over parameter values, and

we'll pick a prior distribution that's easy to work with.

Not one that necessarily fits what we really believe about commons.

And then we'll show how that prior distribution get modified by data if we

adopt the Bayesian Approach. So, we're gonna start with a prior

distribution that says all the different values of p are equally likely.

We believe that coins come biased to various extents.

And any amount of bias is equally likely. So that some coins come down heads half

the time. Other coins come downs heads all the time.

And those two kinds of coins are equally likely.

We now observe a coin coming down heads. So what we do now, is for each possible

value of p, we take its prior probability and we multiply by the probability that we

would have observed ahead, given that, that value of p is the correct one.

So, for example, if we take the value of P = one which says coins come down heads

every time then the probability of observing a head would be one.

There would be no alternative. And if we take the value of P to be zero

the probability of observing a head would be zero.

And if we take it to b.5, the probability of observing having b is.5.

So we take that red line, that's our prior, and we multiply each point on that

by the probability of observing a head according to that hypothesis.

And now we get the sloping-like, that's a unnormalized posterior.

It's unnormalized because the area under that line doesn't add up to one.

And of course for a probability distribution, the probabilities of all the

alternative events have to add to one. So the last thing we do is re-normalize

it. We scale everything up so the area under

the curve is one. And now if we started with the uniform pri

distribution of the P we end up with this triangular posterior distribution of the P

having observed one head. Now let's do it again.

And this time let's suppose we get a tail. So, the prior distribution that we start

with now is the post-serial distribution we had after observing our one head.

And now, the green line shows the probability that we will get a tail

according to each of those hypotheses that correspond to a value of P.

So, for example, if P is one, the probability we would observe at time is

zero. So we have to multiply our prior by our

likelihood term. And now we get a curve like that.

Then we have to re-normalize to make the area be one.

And that's now a posterior distribution, after having observed one head and one

tail. I notice it's a pretty sensible

distribution. After observing one of each, we know that

P can't be either zero or one, and it also seems very sensible that the most likely

thing is now in the middle. If we do this again another 98 times, and

keep applying the same strategy of multiplying the posterior we had after the

last task, by the likelihood of observing that event, given the various different

settings of the parameter p. And let's suppose we get 53 heads and 47

tails in all. Then we'll end up with a curve that looks

like this. It'll have its mean at 53..

Because we started with the uniform prior. And it'll be fairly sharply peaked by 53..

But it'll allow other values like 49. is a perfectly reasonable value under this

curve. Not quite as lengthy as 53., but very

reasonable. Whereas a value of 25. is extremely

unlikely under this curve. So we can summarize all that with base

therm. Determine the middle of this equation is

the joint probability of a set of parameters W, and some data D.

And for supervised learning, the data is gonna consist of the target values.

So we assume we are given inputs and the data values consist of the target values

that are associated with those iinputs. That what we observe.

9:44

Base theorem says that the posterior probability of a particular value of W,

given the data D, is just the prior probability of that particular value of W

times the probability given that particular value of W, that you would have

observed the data you observed. And that has to be normalized by P of D.

The probability of the data which is simply the integral over all possible

values of W, of P of W, P of D, given W. The bottom line needs to be the sum of the

top line a row of possible values w in order for this to be a probability

distribution that adds to one. Because that p of d has integrated over

all possible values of w, it's not affected by picking a particular value of

w on the left-hand side. So when we're looking for the best value

of w, for example, we can ignore p of d. It doesn't depend on w.

The other two terms on the right-hand side, however, do depend on w.