So, I just showed you that empirically the likelihood will converge, but theoretically it can also be proved that EM algorithm will converge to a local maximum. So here's just an illustration of what happened and a detailed explanation. This required more knowledge about that, some of that inequalities, that we haven't really covered yet. So here what you see is on the X dimension, we have a c0 value. This is a parameter that we have. On the y axis we see the likelihood function. So this curve is the original likelihood function, and this is the one that we hope to maximize. And we hope to find a c0 value at this point to maximize this. But in the case of Mitsumoto we can not easily find an analytic solution to the problem. So, we have to resolve the numerical errors, and the EM algorithm is such an algorithm. It's a Hill-Climb algorithm. That would mean you start with some random guess. Let's say you start from here, that's your starting point. And then you try to improve this by moving this to another point where you can have a higher likelihood. So that's the ideal hill climbing. And in the EM algorithm, the way we achieve this is to do two things. First, we'll fix a lower bound of likelihood function. So this is the lower bound. See here. And once we fit the lower bound, we can then maximize the lower bound. And of course, the reason why this works, is because the lower bound is much easier to optimize. So we know our current guess is here. And by maximizing the lower bound, we'll move this point to the top. To here. Right? And we can then map to the original likelihood function, we find this point. Because it's a lower bound, we are guaranteed to improve this guess, right? Because we improve our lower bound and then the original likelihood curve which is above this lower bound will definitely be improved as well. So we already know it's improving the lower bound. So we definitely improve this original likelihood function, which is above this lower bound. So, in our example, the current guess is parameter value given by the current generation. And then the next guess is the re-estimated parameter values. From this illustration you can see the next guess is always better than the current guess. Unless it has reached the maximum, where it will be stuck there. So the two would be equal. So, the E-step is basically to compute this lower bound. We don't directly just compute this likelihood function but we compute the length of the variable values and these are basically a part of this lower bound. This helps determine the lower bound. The M-step on the other hand is to maximize the lower bound. It allows us to move parameters to a new point. And that's why EM algorithm is guaranteed to converge to a local maximum. Now, as you can imagine, when we have many local maxima, we also have to repeat the EM algorithm multiple times. In order to figure out which one is the actual global maximum. And this actually in general is a difficult problem in numeral optimization. So here for example had we started from here, then we gradually just climb up to this top. So, that's not optimal, and we'd like to climb up all the way to here, so the only way to climb up to this gear is to start from somewhere here or here. So, in the EM algorithm, we generally would have to start from different points or have some other way to determine a good initial starting point. To summarize in this lecture we introduced the EM algorithm. This is a general algorithm for computing maximum maximum likelihood estimate of all kinds of models, so not just for our simple model. And it's a hill-climbing algorithm, so it can only converge to a local maximum and it will depend on initial points. The general idea is that we will have two steps to improve the estimate of. In the E-step we roughly [INAUDIBLE] how many there are by predicting values of useful hidden variables that we would use to simplify the estimation. In our case, this is the distribution that has been used to generate the word. In the M-step then we would exploit such augmented data which would make it easier to estimate the distribution, to improve the estimate of parameters. Here improve is guaranteed in terms of the likelihood function. Note that it's not necessary that we will have a stable convergence of parameter value even though the likelihood function is ensured to increase. There are some properties that have to be satisfied in order for the parameters also to convert into some stable value. Now here data augmentation is done probabilistically. That means, we're not going to just say exactly what's the value of a hidden variable. But we're going to have a probability distribution over the possible values of these hidden variables. So this causes a split of counts of events probabilistically. And in our case we'll split the word counts between the two distributions. [MUSIC]