0:03

All right. Now that we know what mean field is and we've derived the formulas,

let's see an example.

It is called an Ising model.

This model is widely used in physics.

So we have a model,

that is a two dimensional lattice.

Its elements are running variables that can take value of -1 or +1.

We also need a functional neighbor that returns a set of neighboring elements.

For example, here, that will have three neighbors.

We define the joint probability over all these variables in the following way.

It would be proportional to an exponent of 1/2 times J,

that is a parameter form model,

times the sum over all edges,

and the product of the two random variables.

If the neighboring values have the same sign,

it will contribute one to the total sum.

And if the product is -1,

it will contribute -1 to the total sum.

And also we have another term that is sum over all nodes are the letters, bi times yi.

This is called an external field.

So we'd know this function exponent of some terms as phi y,

and we'll see what we can do with this model.

But first of all,

let's interpret it somehow.

If J is greater than one,

then the values yi will tend to have the same sign.

This is the case for ferromagnetics.

And yi's can be interpreted as spins of atoms.

If J is less than one,

the neighboring spins will tend to anti-align,

and this is shown on the right.

So, we have defined the distribution up to a normalization constant.

But to compute it, we'll have to sum up over all possible states.

And there are two power and square terms and it seems impossible for large lattices.

So let's try to apply a meaningful approximation to compute the p or y approximately.

So we'll try to approximate this by product of

some terms and each term would be a distribution over each variable.

So here's an example.

We have four nodes here,

and the central node i.

The external field parameterized by branches

B is shown here as the yellow and the green sides.

On the yellow side,

there is a strong negative field,

and on the green side there is strong positive field.

So on the positive field,

the values of the corresponding nodes would try to have both positive sign.

In negative field, the nodes would try to have a negative sign.

So actually in this case,

the left node would say something like,

I feel the negative field on me.

And the other three nodes will say something like,

they feel the positive field.

And so they have some information and

they will try to contribute it to the current node i.

And this would be done using mean field formula.

So here is our joint distribution.

And here's a small picture of our model.

So we're interested now in deriving the update formula for the mean field approximation.

So what we'd like to do,

is to find the q of Yk.

We'll do this using mean field approximation.

So, the idea is that the neighboring points already know some information,

about the external fields B.

For example, this says that there is external field of the sign plus,

this also plus, here minus and minus.

So it has some information and they want to propagate it to our node.

And we'll see how it is done.

So the formula that we derived in the previous video looks as follows.

We know that the logarithm of qk,

let me write it down an index K here,

equals to the expectation over all variables except for the K,

and we write it down as q minus K,

the logarithm of the actual preview that we are trying to approximate.

So it will be p of y, plus some constant.

Notice here that we didn't write down the full distribution,

since we do not know the minimization constant.

However, here we can take it out into the constant.

So now we can omit the terms that do not depend on Yk.

And if we write it down carefully,

we'll get the following formula.

So have that expectation,

over all terms except for the K's.

So the overall of these terms.

So we can omit the exponent and have it on this thing.

So let me write it down.

It's like this, J sum over J that are

neighbors of K, Yk,Yj.

So I omitted one half here since in this formula,

we use each edge twice.

And so here, we only want to write down once.

And plus Bk, Yk.

So this goes under the exponent and goes on constant.

All right.

We can tell you the expectation and put it under the summation.

We'll get J times the sum over J that are

neighboring points for the current note.

We take expectation over all variables except for the

K. So Yk is actually constant with respect to the integration,

so we can write it down here and take the expectation Yj.

And this term is simply a constant with respect to integration here.

So we'll have just Bk, Yk plus on constant.

So let's note the expected value of Yj as mu J.

It's just that mean value of the J's node.

And actually, the information that the node obtained from the other nodes or

from the external field in this point is contained in the value of mu J.

So, this equals to,

we can actually group the terms corresponding to Yk and get the following function.

This would be Yk times the J sum over mu J under the neighbors.

7:56

Plus BK. Since we don't want to write this down multiple times,

let's say that this thing equals to some constant M,

and find the close constant.

So now, I want to estimate the distribution QK but for now it's only up to a constant.

Let's take the exponent of both parts,

and also remember that the interval of QK should be equal to one.

In this case, it means that Q of plus one,

plus Q of minus one should be equal to one.

We can plug in this formula here.

I will have exponent of,

so here Yk equals to plus one exponent of M times the exponent of the constant,

let's right it down C,

plus again the same constant C,

and the E to power of minus M, and it should be equal to one.

C here should be equal to one over E to the power of M,

plus E to the power of minus M. This is the value for the constant.

And finally, we can compute the probabilities.

So the probability that Q equals one,

equals E to the power of M over this constant C,

each with the power of M, plus E to the power of minus M. What is this function?

What do you think? We can multiply it by each with the power of minus M,

we'll have one over one plus E with the power of the minus 2M,

and actually equals the sigmoid function of 2M.

All right, so now we can update the distributions,

however, we need to do one more thing.

Notice here that we used the value of μj.

For the other nodes to be able to use these constants μj,

we need to update the μj for our node.

We need to compute the μK. This is an expectation of Yk.

It is simply Q at the position plus one,

minus Q at the position of minus one.

We can again plug in the similar formulas.

This would be each with the power of M minus E to

the power minus M over the normalization constant.

As you may notice,

this actually equals to the hyperbolic tangent.

Here's our final formula.

Let's again and see how it works.

We iterate our nodes,

we select some node.

We compute the probabilities Q,

and then update the value of μk.

And while we update the probabilities Q, we also use the values μj.

Also actually, here it is QK,

which is actually true since we're estimating the values for the μK.

Now that we've derived an update formula,

let's see how this one will work for different values of J.

Here's our setup.

We have two areas,

the white one corresponds to the positive external field,

and the black one corresponds to the negative external field.

If J is 0,

then with probability one on the white area,

the spins would tend to be plus one.

On the black area,

the probability would be one for having minus one.

And everywhere else we'll have the probability 0.5 for each possible state.

It happens because there is no interaction between neighboring points since J is 0.

You will have the negative J.

We'll have a chess-like field.

The neighboring points would try to have opposite signs,

there will be blacks and whites nearby.

As we go further from external field,

the interaction is slower,

and so when we're really far away from the field,

the probability is actually nearly 0.5,

which will indicate that there can be either plus one or minus one.

All right. The final example is a strong positive J.

In this case, we'll get a picture like this,

one part would be white that means that we'll

have plus one with probability one on the left upper corner,

and everywhere else we'll have minus one with probability one.

So actually, this situation should have be symmetric.

Why didn't we get the opposite picture when there would be a right lower corner,

black, and other things would be white?

This is actually a property of the KL diversions.

Here I have a [inaudible] the bimodal distribution,

and I try to, approximate it by fitting the KL diversions.

There could be two possible cases.

One is left is that the KL diversions would fit one node,

and on the second one is that we fit something in the middle.

What do you think would happen when minimize the KL diversions

between the bayesian distribution and [inaudible].

Let's first of all see what are the properties of those two things.

The second one captures the statistics,

so it would have for example the correct mean.

However, the first one has

the very important property that the mode has high probability.

In the second example,

the probability of mode is really low.

It seems that the mode is actually impossible, and so,

for many practical cases,

the first fit would be nicer and actually this is the case.

Let's see why it happens. All right.

So here's our KL diversions.

It is an integral of Q of Z times log of the ratio.

Let's see what happens if we assign

non-zero probability to the Q and zero probability to the P-star.

In this case, the KL diversions would have a value of plus infinity.

And so, the KL divergence would try to avoid

giving non-zero probability to the regions that are impossible from the first tier.

It is called a zero avoiding property of KL divergence,

and it turns out to be useful in many cases.