0:00

We describe the expectation maximization algorithm and gave the formulas for how

Â to use it but we didn't talk about how it might perform in practice so let's talk

Â about that a little bit. So first let's look at how EM behaves as

Â a function of the number of iterations. So over here on the x axis we see the

Â number of iterations going from zero to 50 and on the y axis we have the log

Â likelihood of the average training instances so averaged over all the

Â training instances. Which is the objective that the algorithm

Â is trying to optimize. Remember, EM is the maximum likelihood

Â optimization algorithm. So what we're showing here is effectively

Â how the objective improves and so the first observation that we have is that as

Â we effectively stated as a theorem. So this shows that the theorem is true.

Â the EM, the log likelihood increases

Â consistently, monotonically, over iterations, t.

Â So that's good. So that shows our theorem is correct.

Â But let's understand what happens now as the algorithm continues to evolve.

Â We see that in the earlier iterations, so up to about, I don't know, iteration

Â eight or ten in this context, the algorithm, the log like exhibits a very

Â sharp improvement. But then it aymptote and so we might

Â think that when we get to the ration tenerso right around here maybe there's

Â no point continuing to run because there's no more increase and we've

Â achieved convergence and we might as well stop.

Â Turns out that's not quite right. What we see here is for the same model

Â with the same graph, the x axis iterations and the y axis is the value of

Â different parameters in the model. And over here we see the same iteration

Â 10 and we see that although the log likelihood no longer continues to improve

Â after duration 10 as we can see on the left hand graph, various parameter values

Â actually exhibit extremely sharp changes after duration ten.

Â specifically the blue parameter goes down significantly, the black parameter goes

Â up significantly.

Â And so and so we see that the convergence in the likelihood space which is what we

Â have on the left is not the same as convergence in parameter space which

Â means that if we're trying to gauge convergence, we probably ought to look at

Â convergence on parameter space. However, it's also important, to realize

Â that carry EM through the convergence is not necassarily the best ideal.

Â So what we see here, is the same graph of the training log likelihood per instance,

Â but on the right hand side, we look, we can the, log likelihood per instance, on

Â test instances, that is instances that the algorithm was not trained on, and

Â what we see, is that, while the log likelihood of testt instances also

Â increases sharply. Up to about iteration ten.

Â After that, it starts to decrease gradually and so, in fact, as the

Â algorithm is continuing to modify its parameters and slightly improve the

Â training log likelihood, the test log likelihood in this particular case

Â actually goes down. So this is an example of over fitting.

Â In this case it's numerical over fitting to the specific data instances that we

Â have, or the noise in those data instances.

Â And in order to avoid that there's two main strategies that people adopt.

Â The first is to use a validation set, or cross-validation, to determine the time T

Â at which we should stop doing iterations and so we basically measure m-log

Â likelihood on the test set and say okay, at this point we can see that it's

Â starting to go down and so we should just stop doing iterations.

Â This is potentially a computationally costly procedure however because

Â evaluating the test log likelihood needs that we need to run inference.

Â On the test sentences and that can be expensive.

Â So a cheaper strategy is to use some kind of map assignment rather than maximum

Â likelihood destination. With the same benefits that we saw when

Â doing estimation for the complete data case.

Â Basically a prior smooths out the estimates and makes them less susceptible

Â to noise in the training data. So that's one aspect of the behavior of

Â the expectation maximization algorithm. And by the way, the same phenomena will

Â also happen if we use gradient ascent, so this is really orthogonal in many ways to

Â the specifics of the optimization algorithm.

Â Except the very sharp increase at the beginning which is very characteristic of

Â the m. A very different problem that we already

Â mentioned before is the fact that both EM and gradient ascent effectively achieve

Â only a local optimum of the log likelihood function.

Â So what we see over here is on the x axis the number of samples m that we're

Â training on. So this is not iterations anymore this is

Â the number of samples. and the y axis is the number of the

Â distinct local optima that the algorithm found over 25 iterations.

Â So sorry 25 applications of the m algorithm initialized at different random

Â starting points. So what we see here is we have three

Â lines. The blue line represents 25% of missing

Â data. The red line is 50% missing data, so 50%

Â of the entries at randomly, were, were omitted.

Â And the black line is when you have a hidden variable.

Â So in this case, the fraction of missing data is actually not perhaps as large,

Â but there is one variable that we never get to observe in any of the instances.

Â And what we see, in both the blue line, for the 25%, and the red line, for the

Â 50%. Is that, initially, we have a very large

Â number of local optima. In fact, initially, the pretty much every

Â iteration of the m finds a distinct local optima.

Â But then as we have more and more training instances.

Â The algorithm, begins to be able to better identify, a global

Â a, a smaller number of optima come to the forefront.

Â The other ones kind of, become less dominant or disappear.

Â Now the more missing data we have so in the red line.

Â The slower this process to reduce the number of local optima.

Â And when we have hidden variables, which is this black line over here, the local

Â optima never go away. Now, you might say that fine, we have

Â local optima, but maybe, we don't care. Maybe these local optima are all about

Â the same and it doesn't really matter which one we land at.

Â Unfortunately that doesn't, turn out to be the case so here we have on the x

Â axis, sorry on the y axis, rather. The, log, particular log likelihood

Â value, that, might be achieved, on, the training set, by

Â By a run of expatiation maximization. And then the Y axis, we have the percent

Â of runs that achieve a particular log likelihood value.

Â And we can see that there are That there are, huge variation, in the

Â log likelihood values that are achieved by different runs of the expectation

Â maximization algorithm. Which brings us to the point that

Â initialization is absolutely critical to the performance of an algorithm such as

Â expectation maximization. And there's different strategies that

Â people adopt in order to try and get to a better local optimum.

Â So first and perhaps. easiest implement in some sense is the

Â use of multiple random research. So we randomly sample parameters for

Â example and we initialize 'em at these different points and we run 'em to

Â convergence and then we pick the run of 'em that has the highest training log

Â like we'd have to all of the objective that we're trying to optimize.

Â In some cases, we might have some amount of prior knowledge, that we might use to

Â initialize the algorithm, for example, we can have some prior knowledge about the

Â parameter, or if we're initializing on the E step, we might have some prior

Â knowledge about the assignment say of, of variables to cluster, of, of instances to

Â clusters for example, from some other source.

Â Making them in graphic data for people or something like that.

Â And finally a common and quite effective initialization strategy is to use a

Â simpler algorithm rather than full scaling because it turns out that while

Â EM very powerful that also makes it more susceptible to getting stop in certain

Â local optimal where some simpler algorithms for example in the context of

Â missing over the hidden variables who might use a simpler clustering algorithm.

Â That you might have heard about in, in a different context.

Â K means or, how to work with [INAUDIBLE] clustering.

Â Which is a simpler algorithm. Not as powerful.

Â But. But is also somewhat less susceptible to.

Â Local optimum, and so we might initialize from that and, then run the M to get a

Â better, output then these simple algorithms might give rise to.

Â So to summarize we've seen that convergence of the likelihood function is

Â not the same as convergence of the parameters and so we have to decide which

Â one we actually care about. We've also seen that running the

Â algorithm to convergence, all the way to convergence, can lead to numerical

Â over-fitting. We talked about strategies to get, to

Â avoid this problem. we talked about local optima, and the

Â fact that they're unavoidable. And, furthermore, they increase with the

Â amount of missing data. So the more missing data we have, the

Â more local optima we have. And, furthermore, they decrease with the

Â total amount of data, but not if we have missing, not if we hidden variables, at

Â which point you're stuck with them no matter how much data you have.

Â We also show that local optimum are not a fake problem.

Â That is, which local optimum you land in can actually be a significant factor in

Â terms of the performance, that the algorithm achieves in terms of log

Â likelihood and, therefore, smart initialization turns out to be often

Â critical for the performance of algorithms you'll learn with latent

Â variables with and with missing data.

Â