To better understand the EM algorithm, let's consider one example and that example is going to be a mixture of two Gaussian distributions. So two component Gaussian mixture. For that distribution, the density of the mixture is going to depend on five parameters. So we're going to have first, the weight that corresponds to the first component, and that component has a Gaussian distribution 2 Pi Sigma exp of minus one-half x minus Mu_1 divided by Sigma square. Then because we have only two components, the weight associated with the second component has to just be 1 minus Omega. Again, 1 over the square root 2Pi Sigma exp minus one-half x minus Mu_2 divided by Sigma square. So as you can see this, it has two components mixture. Each one of the components is Gaussian. The main difference is that we allow the locations of the two components to be different, but we keep the variances of those two components the same. So this is a particular case of a mixture of two normals where we're forcing differences to be the same. If we think about it in terms of the notation that we had been using before, then what we're doing here is essentially saying that our g of x given Delta just takes the form Delta k, takes the form 1 over the square root 2Pi Sigma exp minus one-half x minus Mu_k divided by Sigma square, and essentially this Delta k corresponds to Mu_k and Sigma together. So in this case, this is a vector that has two components that corresponds to those two parameters in the Gaussian distribution. So once we write the model in this way and we recognize what our g function is in this case, then it's very easy to write the weights for the EM Algorithm. Those are V_i in this case 1 iteration S. Remember that those are essentially an application of Bayes theorem. So it just tells me if I knew what the parameters are for each one of the components, in this case, what is the probability that the observation was generated from component one? So that in this case is simply Omega 1 over the square root 2Pi Sigma S exp of minus one-half X sub i, because now we're talking about observation i, minus Mu_1, and this should be hat and hat and S hat and S Sigma hat S square. So this is just this kernel g evaluated in the current maximum likelihood estimate for the parameters. This has to be divided by, first of all, the first quantity, Omega hat S 1 over the square root 2Pi Sigma hat S exp minus one-half X_i minus Mu_1 hat S divided by Sigma hat S square plus now the kernel for the second component, as in they associated prior probability. So this is 1 minus Omega hat of S 1 over square root 2Pi Sigma hat S exp of minus 1 over 2 Sigma hat, excuse me. I'm going to put it inside there. So minus one-half X_i minus Mu_2 hat S divided by Sigma S hat squared. So this is the form of the weight associated with the i-th observation and the first component. Of course we need also V_i,2_S. In this case, because there are only two components, we know that that's just 1 minus V_i,1_S. So once we have this calculation, it's very easy to get what is the weight associated with the second component for that same observation. The next thing that we need to do is write the Q function for this particular model that we're going to optimize. Remember that the Q function is a function of the parameters. So in this case, Omega Mu_1, Mu_2, and Sigma. Given the current estimates of those parameters. So given Omega hat of S Mu_1 hat of S Mu_2 hat of S and Sigma hat of S. So if you remember, what we're doing here is taking the log-likelihood of the full data likelihood, taking the expectation with respect to those membership indicators, that is basically the Vs that we just wrote down. That gives us a function for where the new values should be given the previous values. So that's why the algorithm is iterative. So just using the expressions that we have derived before, we know that this is going to be equal to the sum over the observations, because it's the log-likelihood, and we have two pieces. The first piece has to do with component one. So it's going to be , V_i,1 of S. That multiplies the log-likelihood. So that's logarithm of Omega plus the logarithm of the likelihood. But that's minus one-half log 2Pi minus log of Sigma minus one-half X_i minus Mu_1 divided by Sigma square, that's the first piece, and the second piece comes from the second components. So has V_i,2 of S log of 1 minus Omega minus one-half log 2Pi minus log of Sigma minus one-half of X_i minus Mu_2 divided by Sigma square. So this is the function that we need to optimize, this Omega hat, Mu_1 hat, Mu_2 hat, and Sigma hat are hidden in the Vs that we just computed. So I'm not writing them explicitly, but they are inside these two Vs. So that's how we're conditioning on them because we're using them to compute these numbers. Now what we need to do is we need to maximize this function with respect to Omega, with respect to Sigma, with respect to Mu_1 and with respect to Mu_2. Let us start by computing the derivative of Q with respect to Omega in making that equal to 0. Well, in this case, the derivative of Q with respect to Omega is very easy to compute because this is just a linear function. So when I go in there, I'm going to get the sum from one to n. I'm going to get V_i,1 of S. Now I need to do the derivative of all of this. But the derivative of all of this, the only piece that depends on Omega is the log of Omega. So that gives me 1 over Omega, and then when I go to the second piece, I'm going to get V_i,2 of S. Again, I just need to do the derivative of this, and the derivative of this is just the derivative of the log of one minus Omega, which is 1 divided by 1 minus Omega multiplied by a minus 1, because of the internal derivative here. So I need to make this quantity equal to 0. So that implies that the sum from one to n of V_i,1_S divided by Omega has to be the same as the sum from i equals one to n of V_i,2 of S divided by 1 minus Omega. So this minus makes this plus or minus, and I'm just moving the expression to the right hand side. So another way to write this is to say that Omega divided by one minus Omega has to be equal to, multiple this, so I have sum from one to n of Vi_1s divided by the sum from i equals one to n of Vi to s. With a little bit of additional algebra, we just get to Omega, has to be equal to the sum of the Vi_1s from one to n divided by the sum from one to n of again, Vi_1s plus Vi_2s. But remember that Vi_1s and Vi_2s have to add up to one. So we define it as one minus zero. So that just means that this expression down here, so each one of the terms is equal to one, and I'm summing n of them. So I just get one over n sum from one to n of Vi_1s, which has a very nice interpretation. If you think about what this quantity is, this just represents the average probability of all the observations that belong to component one. Let's compute now the derivative with respect to Mu_1. I'm just going to do Mu_1 because the computation for Mu_ 2 is going to be pretty much identical. So to compute the derivative of Q with respect to Mu_1, it's going to be very easy. So we have the sum from one to n of Vi_1s. Now inside here, the only term that depends on Mu_1 is this term here, and down here, there is nothing that depends on Mu_1. So what we get is minus one-half, two from the exponent Xi minus Mu divided by Sigma. We need to make this equal to zero. Now, this two terms cancel out, the minus sign, you can just reverse sign scene here. In Sigma, we know that it has to be greater than zero. So we can, if you will move it to the right-hand side, multiplying. So this whole expression can be somewhat simplified by saying that what we want is the sum from one to n of Vi_1s times Mu minus Mu_1. Very important, minus Xi has to be equal to zero. But now, this is very easy to resolve. So this just implies, if I break this into both sides, we have the sum from one to n of Vi_1s multiplied by Mu_1 has to be equal to the sum from one to n of Vi_1s multiplied by Xi. Mu_i doesn't depend on i, so it can just come out of the sum. That just leaves me with Mu_1 being equal to the sum from one to n of Vi_1s. Xi divided by the sum from one to n of Vi_1s. So the expression has a very nice interpretation. So you're looking at a weighted average of the observations that weights are given by the Vi_1s. If you remember, this is just the probability that the observation was generated by that component. Our current estimate of what the probability is. So observations that are more likely to come from component one have a heavier weight in the weighted average for the mean of the component one. So observations are in some sense closer to the current center or our current estimate of the center of the component have a higher weight in terms of computing the next value for that center. The expression for Mu_2, it's going to be very, very similar. The main difference is going to be that instead of using Vi_1s in this expression, you are going to be using Vi_2 in those expressions. I suggest that you do the calculation and you make sure that you see that you get exactly the same expression with that small change. To complete the algorithm, we still need to work, what is the new maximum likelihood estimator for Sigma? To do that, we need to compute the derivative of Q with respect to Sigma. Now, we have to be a little bit careful because Sigma actually appears in a number of places in the expression. So we have as always our sum over the observations. We have Vi_1 of s. Now, this term and this term do not depend on Sigma. So we have minus one over Sigma is the derivative of this piece, and then we have minus one-half. Now, this expression, Sigma appears in the denominator and it's a square, so we can think about it as Sigma to the minus two. That brings down a minus two, one over Sigma to the cube and leaves us with Xi minus Mu_1 squared. So that is the first piece of the expression. The second piece is going to be very similar. We have Vi_2s and again, the first two terms do not depend on Sigma. The third just gives you minus one over Sigma. Then we have minus one-half again, minus two coming from the Sigma to the minus two, one over Sigma to the cube, Xi minus Mu_2 squared. We need eventually to have this expression equal to c. So let's simplify this a little bit. To do that, let's pull together terms that look the same. So let's pull the terms that have one over Sigma in them together and then the other terms. So we have the sum from one to n of minus one over Sigma that multiplies Vi_1 of s plus Vi_2 of s. Then, the other term again, looks very similar. So we're going to have minus one-half multiplied by minus two is going to end up giving me a plus, just like here. The minus two multiplied by minus one-half gives me a plus one. So I have plus one over Sigma cube that multiplies Vi_1 of s times Xi minus Mu_1 square plus Vi_2 of s that multiplies Xi minus Mu_2 square. So just reorganizing the terms. Now, this is very nice because if I want to make this equal to zero, one of the things that I can do, is I can cancel this with the cube and so I get two in here instead. One term is negative, the other one is positive. So I can just move them to opposite sides of the equal sign. So I'm going to end up with the sum from one to n of Vi_1 of s plus Vi_2 of s has to be equal to the sum from one to n of one over Sigma squared. Vi_1 of s times Xi minus Mu_1 plus Vi_2 of s Xi minus Mu_2. Well, now, this ends up being very simple now, so I can move the one over Sigma square to the other side. The other thing that I know is that these two guys add up to one just by definition. So at the end we get that Sigma square has to be equal to one over n. Sum from one to n of Vi_1 times Xi minus Mu_1 square plus Vi_2, Xi minus Mu_2 square. Again, this has a very nice interpretation. All I'm saying is that it is in some sense a weighted average of the variances that you would get by computing only with the mean one, or only with the mean two, and the weights are based on how likely the observation is to come from each one of the two components.