0:00

In this video, I'm going to describe a new way of combining a very large number of

Â neural network models without having to separately train a very large number of

Â models. This is a method called dropout that's

Â recently been very successful in winning competitions.

Â For each training case, we randomly omit some of the hidden units.

Â So, we end up with a different architecture for each training case.

Â We can think of this as having a different model for every training case.

Â And then, the question is, how could we possibly train a model on only one

Â training case and how could we average all these models together efficiently at test

Â time? The answer is that we use a great deal of

Â weight sharing. I want to start by describing two

Â different ways of combining the outputs of multiple models.

Â In a mixture, we combine models by averaging their output probabilities.

Â So, if model A assigns probabilities of 0.3, 0.2 and 0.5, to three different

Â answers, model B assigns probabilities of 0.1, 0.8 and 0.1, the combined model

Â simply assigns the averages of those probabilities.

Â A different way of combining models is to use a product of the probabilities. Here,

Â we take a geometric mean of the same probabilities.

Â So, model A and model B again assign the same probabilities as they did before.

Â But now, what we do is we multiply each pair of probabilities together and then

Â take the square root. That's the geometric mean and the

Â geometric means will generally add up to less than one.

Â So, we have to divide by the sum of the geometric means to normalize the

Â distribution so that it adds up to one again.

Â 1:54

You'll notice that in a product, a small probability output by one model, has veto

Â power over the other models. Now I want to describe an efficient way to

Â average a large number of neural nets that gives us an alternative to doing the

Â correct Bayesian thing. The alternative probably doesn't work

Â quite as well as doing the correct Bayesian thing, but it's much more

Â practical. So, consider the neural net with one

Â hidden layer, shown on the right. Each time we present a training example to

Â it, What we're going to do is randomly emit

Â each hidden unit with a probability of 0.5.

Â So, we crossed out three of the hidden units here.

Â And we run the example through the net with those hidden units absent.

Â What this means is that we're randomly sampling from two to the h architectures,

Â where h is the number of hidden units, It's a huge number of architectures.

Â Of course, all of these architectures show weights.

Â That ism whenever we use a hidden unit, it's got the same weight as it's got in

Â other architectures. So, we can think of dropout as a form of

Â model averaging. We sample from these two to the h models.

Â Most of the models, in fact, will never be sampled.

Â And a model of this sampled only gets one training example.

Â That's a very extreme form of bagging. The training sets are very different for

Â the different models, but they're also very small.

Â The sharing of the weights between all the models means that each model is very

Â strongly regularized by the others. And this is a much better regularizer than

Â things like L2 or L1 penalties. Those penalties pull the weights toward

Â zero. By sharing weights with other models, a

Â models gets regularized by something that's going to tend to pull the weight

Â towards the correct value. The question still remains what we do with

Â test time. So, we could sample many of the

Â architectures, maybe a hundred, and take the geometric mean of the output

Â distributions. But that would be a lot of work.

Â There's something much simpler we can do. We use all of the hidden units, but we

Â halve their outgoing weights. So, they have the same expected effect as

Â they did when we were sampling. It turns out that using all of the hidden

Â units with half their outgoing weights, exactly computes the geometric mean that

Â the predictions that all two to the h models would have used, provided we're

Â using a softmax output group. If we have more than one hidden layer, we

Â can simply use drop out at 0.5 in every layer.

Â 5:10

We could run lots of stochastic models with dropout, and then average across

Â those stochastic models. And that would have one advantage over the

Â mean net. It would give us an idea of the

Â uncertainty in the answer. What about the input layer?

Â Well, we can use the same trick there, too.

Â We use dropout on the inputs, but we use a higher probability of keeping an input.

Â This trick's already in use in a system called denoising autoencoders, developed

Â by Pascal Vincent, Hugo Laracholle and Yoshua Bengio at the University of

Â Montreal, and it works very well. So, how well does dropout work?

Â Well, the record breaking object recognition net developed by Alex

Â Krizhevsky would have broken the record even without dropout.

Â But it broke a lot more by using dropout. In general, if you have a deep neural net

Â and it's overfitting dropout, it will typically reduce the number errors by

Â quite a lot. I think any net that requires early

Â stopping in order to prevent it overfitting would do better by using

Â dropout. It would, of course, take longer to train and it might mean more hidden

Â units. If you got a deep neural net and it's not

Â overfitting, you should probably be using a bigger one and using dropout, that's

Â assuming you have enough computational power.

Â There's another way to think about dropout, which is how I originally arrived

Â at the idea. And you'll see it's a bit related to

Â mixtures of experts, and what's going wrong when all the experts cooperate,

Â What's preventing specialization? So, if a hidden unit knows which other

Â hidden units are present, it can co-adapt to the other hidden units on the training

Â data. What that means is, the real signal that's

Â training a hidden unit is, try to fix up the error that's leftover when all the

Â other hidden units have had their say. That's what's being back propagated to

Â train the weights of each hidden unit. Now, that's going to cause complex

Â co-adaptations between the hidden units. And these are likely to go wrong when

Â there's a change in the data. So, a new test data,

Â If you rely on a complex co-adaptation to get things right on the training data,

Â it's quite likely to not work nearly so well on new test data.

Â It's like the idea that a big, complex conspiracy involving lots of people is

Â almost certain to go wrong because there's always things you didn't think of.

Â And if there's a large number of people involved, one of them will behave in an

Â unexpected way. And then, the others will be doing the

Â wrong thing. It's much better if you want conspiracies,

Â to have lots of little conspiracies. Then, when unexpected things happen, many

Â of the little conspiracies will fail, but some of them will still succeed.

Â So, by using dropout, We force a hidden unit to work with

Â combinatorially many other sets of hidden units.

Â And that makes it much more likely to do something that's individually useful

Â rather than only useful because of the way particular other hidden units are

Â collaborating with it. But it is also going to tend to do

Â something that's individually useful and is different from what other hidden units

Â do. It needs to be something that's marginally

Â useful, given what its co-workers tend to achieve.

Â And I think this is what's giving nets with dropout, their very good performance.

Â