0:00

In this video, I'm going to talk about the mixture of experts model that was

Â developed in the early 1990s. The idea of this model is to train a

Â number of neural nets, each of which specializes in a different part of the

Â data. That is, we assume we have a data set

Â which comes from a number of different regimes,

Â And we train a system in which one neural net will specialize in each regime, and a

Â managing neural net will look at the input data, and decide which specialist to give

Â it to. This kind of system, doesn't make very

Â efficient use of data, because the data is, fractionated over all these different

Â experts. And so with small data sets, it can't be

Â expected to do very well. But as data sets get bigger, this kind of

Â system may well come into its own, because it can make very good use of extremely

Â large data sets. In boosting, the weights on the models are

Â not all equal, But after we finish training, each model

Â has the same weight for every test case. We don't make the weights on the

Â individual models depend on which particular case we're dealing with.

Â In mixture of experts, we do. So the idea is that we can look at the

Â input data for a particular case during both training and testing to help us

Â decide which model we can rely on. During training this will allow models to

Â specialize on a subset of the cases. They then will not learn on cases for

Â which they're not picked. So they can ignore stuff they're not good

Â at modeling. This will lead to individual models that

Â are very good at some things and very bad at other things.

Â 2:18

To fit it, you just store the training cases.

Â So, that's really simple, And then if you have to predict Y from X,

Â you simply find the stored value of X that's closest to the test value of X,

Â then you predict the value of Y that's the same as for the stored value.

Â The result of that is that the curve relating the input to the output consists

Â of lots of horizontal lines connected by cliffs.

Â It would clearly make more sense to smooth things out a bit.

Â At the other extreme, we have fully global models, like fitting one polynomial to all

Â the data. They're much harder to fit to data, and

Â they may also be unstable. That is, small changes in the data may

Â cause big changes in the model you fit. That's because each parameter depends on

Â all the data. In between these two ends of the spectrum,

Â we have multiple local models, that are of intermediate complexity.

Â 3:16

This is good if the data set contains several different regimes and those

Â different regimes have different input/output relationships.

Â In financial data for example the state of the economy has a big effect on

Â determining the mappings between inputs and outputs, and you might want to have

Â different models for different states of the economy.

Â But you might not know in advance how to decide what constitutes different states

Â of the economy, you're going to have to learn that too.

Â 4:09

But we don't want to cluster the data based on the similarity of input vectors.

Â All we're interested in is the similarity of input-output mappings.

Â So if you look at the case on the right, there's four data points that are nicely

Â fitted by the red parabola and another four data points that are nicely fitted by

Â the green parabola If she partition the data based on the input I put mapping,

Â that is based on the idea that a parabola will fit the data nicely, then you

Â partition the data where that brown line is.

Â 4:45

If however you partitioned the data by just clustering the inputs, we partition

Â where the blue line is, and then if you looked to the left of that blue line,

Â you'll be stuck with a subset of data that can't be modeled nicely by a simple model.

Â So I'm going to explain an error function that encourages models to cooperate.

Â And then I'm going to explain an error function that encourages models to

Â specialize. And I'm going to try to give you a good

Â intuition for why these two different functions have these very different

Â effects. So if you want to encourage cooperation,

Â what you should do is compare the average predictors with the target and train all

Â the predictors together to reduce the difference between the target and their

Â average. So using angle back as for expectation

Â again, the error would be the difference between the target and the average of all

Â the predictors of what they predict. That will overfit badly.

Â 5:59

So, if you're averaging models during training, and training so that the average

Â works nicely, you have to consider cases like this.

Â On the right, we have the average of all the models except for model I.

Â So, that's what everybody else is saying when their votes are averaged together.

Â On the left, we have the output of model I.

Â Now if we'd like the overall average to be closer to the target, what do we have to

Â do to the output of the Ith model? We have to move it away from the target.

Â That will take the overall average towards the target.

Â You can see that what's happening is model I is learning to compensate for the errors

Â made by all the other models. But do we really want to move model I in

Â the wrong direction? Intuitively it seems better to move model

Â I towards the target. So here is an arrow function that

Â encourages specialization, and it's not very different.

Â To encourage specialization, we compare the output of each model with the target

Â separately. We also need to use a manager to determine

Â the weight we put on each of these models, which we can think of as the probability

Â of picking each model, if we have to pick one.

Â 7:22

So now, our error is the expectation over all the different models of the squared

Â error made by that model times the probability of picking that model,

Â Where the manager or gating network, is determining that probability by looking at

Â the input for this particular case. What will happen if you try to minimize

Â this error is that most of the experts will end up ignoring most of the targets.

Â Each expert will only deal with the small subset of the training cases and it will

Â learn to do very well on that small subset.

Â 8:28

So we have an input. Our different experts will look at that

Â input. They all make their predictions based on

Â that input. In addition we have a manager, a manager

Â might have multiple layers and the last layer for manager is a soft max layer, so

Â the manager outputs as many probabilities as there are experts,

Â And using the outputs of the manger and outputs of the experts, we can then

Â compute the value of that error fraction. If we look at the derivative of that other

Â function, The outputs of the manager are determined

Â by the inputs xi to the soft max group in the final layer of the manager.

Â And then the error is determined by the outputs of the experts, and also the

Â probabilities output by the manager. If we differentiate that error with

Â respect to the outputs of an expert, we get a signal for training that expert and

Â that gradient that we get with respect to the output of an expert is just the

Â probability of picking that expert, times the difference between what that expert

Â says in the target. So if the manager decides that there's a

Â very low probability of picking that expert for that particular training case,

Â the expert will get a very small gradient, and the parameters inside that expert

Â won't get disturbed by that training case. It'll be able to save its parameters for

Â modeling the training cases where the manager gives it a big probability.

Â 10:03

We can differentiate with respect to the outputs of the gating network.

Â And actually what we're gonna do is differentiate with respect to, the

Â quantity that goes into the soft max. That's called the low jet, that's xi,

Â And if we take the derivative with respect to xi, we get the probability that, that

Â expert was picked times the difference between the squared arrow made by that

Â expert and the average overall experts when you use the weighting provided by the

Â manager of the squared arrow. So what that means is, if expert I makes a

Â lower squared error than the average of the other experts, then we'll try to raise

Â the probability of expert i. But if expert I makes a higher squared

Â error than the other experts, we'll try and lower his probability.

Â That's what causes specialization. Now there's actually a better cost

Â function. It's just more complicated.

Â It depends on mixture models, which I haven't explained in this course.

Â Again, those will be well explained in Andrew Ing's course.

Â I did explain, however, the interpretation of maximum likelihood, when you're doing

Â regression, as the idea that the network is actually making a Gaussian prediction.

Â That is the network outputs a particular value, say Y1 and we think of it as making

Â bets about what the target value might be that are a Gaussian distribution around Y1

Â with unit variance. So the red expert makes a Gaussian

Â distribution of predictions around by Y1 and the green expert makes a prediction

Â around Y2. The manager then decides probabilities for

Â the two experts and those probabilities are used to scale down the Gaussians.

Â Those probabilities have to add to one and they are called mixing proportions.

Â And so once we scale down the Gaussians we get to distribution that's no longer a

Â Gaussian, is the sum of the scale down red Gaussian and the scale down green

Â Gaussian. And that's the predictive distribution

Â from share experts. What we want to do now is maximize the log

Â probability of the target value under that black curve and remember the black curve

Â is just the sum of the red curve and the green curve.

Â 12:45

And it's the sum over all the experts, of the mixing proportion assigned to that

Â expert by the manager or gating network times e squared the squared difference

Â between the target and the output of that expert,

Â Scaled by the normalization term for a Gaussian with a variance of one.

Â And so our cost function is simply going to be the negative log of that probability

Â on the left. We're going to try and minimize the

Â negative log of that probability.

Â