And then, we can put this summation with respect to the values of the

latent variable outside of all the expression,

so we'll have sum with respect to the objects in the data-set,

sum with respect to the values of latent variables from one to three for example,

the weights from the variation distribution times the difference between the logarithms.

So, logarithm of the marginal log likelihood

minus logarithm of the ratio

of the joint distribution,

xi and ti divided by the variation distribution Q.

And, since the logarithm has the property that

difference between the logarithms is logarithm of the ratio,

we can rewrite this whole expression like this.

So, it equals to the sum with respect to objects,

sum with respect to the values,

getting weights from the variation distribution Q,

times the logarithm of the ratio.

So, logarithm of the marginal likelihood P of xi,

given parameters theta, divided by this ratio.

So, divided by the joint distribution P of xi and ti,

given parameters and this thing should be divided by Q,

but we can put this Q in numerator,

because it's like division twice.

So, now to simplify this thing,

we can notice that, by the definition of conditional probability,

this part equals to probability of ti equals C given the data,

given xi and theta,

times the prior distribution P

of xi, given theta.

And so, these two terms vanish because they appear both in numerator and denominator.

And finally, we have an expression like this.

I have sum with respect to the objects,

sum with respect to the values of latent variable,

the variance of the variational distribution Q, times logarithm of Q,

divided

by the distribution of ti.

So, probability of ti given C equals to C, given that the data-point xi,

and the parameter theta.

So, look closer to this final expression.

This thing exactly equals to the cubicle labor diversions between the two distributions.

So, this is a KL-divergence

between Q of ti,

and the posterior distribution P of ti given C equals to C,

given xi and theta.

So, to summarize what we have just derived,

the gap between the marginal log likelihood and

the lower bound equals to the sum of Kullback-Leibler divergences.

So, this thing could be sum with respect to the objects in

the data-set of KL divergences between Q of ti,

and the posterior distribution.

And, we want to maximize this lower bound with respect to theta.

So, we want to push this lower bound as high as possible,

maximize it with respect to, I'm sorry, not theta but Q.

Maximizing this expression with respect to Q,

is the same as minimizing the minuses expression, right?

So, it's the same as minimizing this thing.

And, note that the marginal log likelihood doesn't depend on Q at all.

So, we can as well minimize this difference.

So, maximizing the lower bound is the same as minimizing this whole difference,

and minimizing this difference is the same as minimizing this sum of KL-divergences

because this is what this difference is.

So, maximizing the lower bound is the same as minimizing

the sum of the KL-divergences with respect to Q.

And, recall that KL-divergences has two main properties.

So, first of all, they are always non-negative,

and second, for they equal to zero whenever the distribution coincides.

So, whenever, this and this two distributions are the same which means that we,

by setting Q to be the posterior distribution,

so Q of ti equals to the posterior,

we will optimize this,

we'll minimize this sum to zero,

so to the global optimal.

This sum cannot be ever lower than zero.

So, whenever we are at zero,

we found the global optimal,

which means we maximize the lower bound to the global optimal as well.

So, to solve the problem on the E-step,

we just have to set the variation distribution Q to be

the posterior distribution on the latent variable ti given the data and the parameters.

So, to summarize, the gap between the log likelihood and the lower bound where

half equals to the sum of Kullback-Leibler divergences within the distribution Q,

in the posterior distribution P of ti

of the latent variable given the data we have and the parameters we have.

Which basically means that,

if you want to maximize this lower bound,

it's the same as minimizing minus lower bound,

and since log likelihood doesn't depend on Q,

it's the same as minimizing this difference,

the left hand side of the expression,

and finally it's same as minimizing this sum of Kullback-Leibler divergences.

And as we know, Kullback-Leibler divergences are non-negative,

and they equal to zero whenever the distributions coincide,

whenever they are the same,

which means that we can minimize this thing to the optimal value by

just setting Q to be the posterior distribution of ti given the data.

So, this is our optimal solution to the E-step.

So just use, just set Q to be posterior with the current values of the parameters,

and it minimizes the gap to be zero,

because KL distance now is zero,

and so the lower bound becomes accurate at the current point.

The gap is zero,

so the value of the lower bound equals to the value of the log likelihood.