0:00

[MUSIC]

In the previous video, we discussed that it

can be nice to build a Markov chain to sample from sum distribution.

But how to build such Markov chain?

A simplest method for that is called Gibbs sampling.

Involves as follows.

So say you have a distribution, a three-dimensional one,

which we know up to normalization cost.

So we can compute this probability or

density in the continuous case at any given point, but up to normalization.

So the Gibbs sampling works as follows.

First of all, we start with some initial point, for example, all 0s.

And it doesn't matter where we start.

Well, it does matter for some convergence properties, and

stuff like that, but it doesn't matter for the sake of correctness.

And then, as we're starting to build our next iteration point,

we're going to build this next iteration point one dimension at a time.

So the first dimension of the next point will be a sample from

conditional distribution, on x1, given the varies of x2 and

x3, which we had from our initialization.

And you may ask, how can we sample from this conditional distribution?

It's also a hard problem, right?

But note that the conditional distribution is proportional to the joint distribution.

So we know it, again, up to normalization constant.

And sampling from one-dimensional distributions,

it's usually much easier of it, than something from my multidimensional one.

So we assume that we can do it.

In some particular practical cases we can invent a really efficient scheme for

analytically sampling from these kind of one-dimensional distributions.

In others, we have to use general purpose schemes, like rejection sampling.

But anyway, one-dimensional sampling is easy.

So let's go to the next sub-step.

On the next sub-step,

we're going to generate the second dimension of our first duration point.

And it's going to be generated from the conditional distribution on x2,

given x1 being the dimension we have just generated,

and x3 being the initialization point dimension.

And finally, we will generate the third dimension in the same fashion.

So we'll generate it from the conditional distribution,

given the stuff we have just generated.

And in the general case, this was the iteration one.

And these three sub-steps constitutes into just one

kind of big step of our Markov chain.

So it's a way we transition from x0 to x1.

And if you write it down for

general iteration number k, it will look like this.

So the point from duration k + 1 is generated one-dimension at a time,

by using some conditional distributions, which are conditioned on the stuff

we have from the previous duration, and from the current duration.

And note that this schema's really not parallel.

So each step depends on the previous steps to be executed.

And this means that no matter how many computers we have,

if we have a 1 million-dimensional space,

we'll have to sample these dimensions one at a time, which is kind of a downside.

But we'll discuss later how can we avoid it.

Anyway, this procedure gives you a way to move from x0 to x1, and to x2, and etc.

And let's prove that this indeed defines a Markov chain,

which converged to the distribution p.

So if we sample like this long enough,

we'll get samples from the desired distribution.

And we are going to do it on the blackboard.

4:12

So let's prove that the Gibbs sampling over the three sub-steps,

considered as one big step, indeed provides you a Markov chain

that converged to the desired distribution p.

So what we want to prove is that p of the new point,

x prime, y prime, and z prime, equals, so

we want to prove that it equals, to the one step for

the Gibbs sampling, starting from the distribution p,

and then doing one step of the sampling.

So here we'll have marginalizing out the previous step, x, y, and z.

And the transition probability, q, probability to go from x,

y, z, to x prime, y prime, z prime.

5:17

Let's look at the right hand side of this expression.

So we have sum with respect to the previous

point of the transition probability.

So how probable is it, under the Gibbs sampling, to move from x,

y, z, to x prime, y prime, z prime?

Well, we have to first of all move from x to x prime,

so that will be least conditional

probability, y = y, and z = z.

Then we have to assume that we move to x prime, and

then what is the probability that we move from y to y prime?

So it'll be y prime, given that we have already

moved to x prime, and started with z.

6:27

So this is the transition probability q, this overall thing, and

times the starting probability, p (x, y, z).

So times p (x, y, z).

And we want to prove that this thing equals to the p (x prime,

y prime, z prime), because in this case,

we will prove that this p is stationary for our Markov chain.

And then, it means that the Markov chain converged to this distribution, p.

So first of all, it's not that this part, p of z prime,

it doesn't depend on x, y or z at all.

So we can move it outside the sum.

It'll be p (z prime | x prime,

y prime) times sum,

7:32

And then, times the probability of p (x, y, z).

And note here that, actually, the only part that depends on x is this one.

So it's p (x, y, z), which basically means that we can push the summation aside.

We can write here the sum with respect to only y and z, and

write the summations with respect to x here.

So it's summation with respect to all values of x,

from 1 to the cardinality of x.

And also, by the way, note that if our variables x, y, and

z would be continuous, we will have integrations instead of sums.

But the overall logic of the proof will be exactly the same.

Okay, so we have this expression.

9:21

Let's start with this term, p (y prime | x prime, z),

times the product of these two joints, which is the joint distribution.

And note that the only part that depends on y is this joint distribution.

So we can move this summation again, with respect to y, inside and right down here,

summation only with respect to z, and here the summation with respect to y.

So it will be p (x prime, y, z),

which is a product of these two terms, okay?

And again, note that when we marginalize out y here,

we end of with p (x prime, z).

11:39

So we're done here.

We have just proven that if you start with distribution p on the step x,

y, z, and then make one step of this Gibbs sampling procedure,

one coordinates at a time, we'll end up with the same distribution,

p, p (x prime, y prime, z prime).

Which basically means that this distribution beam's stationary for

the Gibbs sampling.

And that's what will converge to this distribution from any starting point.

And basically, it means that if we use Gibbs sampling, it, indeed,

will eventually start to produce us samples from the distribution p.

[MUSIC]