0:00

Very, very deep neural networks are difficult to train because of

Â vanishing and exploding gradients types of problems.

Â In this video, you learn about skip connections which

Â allows you to take the activation from one layer and

Â suddenly feed it to another layer, even much deeper in the neural network.

Â And using that, you're going to build ResNets which enables you to train very,

Â very deep networks, sometimes even networks of over 100 layers.

Â Let's take a look.

Â ResNets are built out of something called a residual block.

Â Let's first describe what that is.

Â Here are two layers of a neural network where you start off with some

Â activation in the l, then you go to a l plus 1, and then the activation.

Â Two layers later is a l plus 2.

Â So to go through the steps of this computation you have a l.

Â And then the first thing you do is you apply this linear operator to it.

Â 1:08

After that, you apply the ReLU

Â nonlinearity to get a l plus 1.

Â And that's governed by this equation where a l plus one is g of z l plus one.

Â Then in the next layer, you apply this linear step again.

Â So it's governed by that equation, right.

Â So this is quite similar to this equation we saw on the left.

Â 1:38

And then finally, you apply another ReLU operation

Â which is now governed by that equation.

Â Where g here would be the ReLU nonlinearity.

Â And this gives you a l plus 2 so

Â in other words the information from a l to flow

Â 2:02

to a l plus 2, it needs to go through all of these steps.

Â Which I'm going to call the main path of this set of layers.

Â In a residual net, we're going to make a change to this.

Â I'm going to take a l and just fast forward it, copy it,

Â much further into the neural network to here.

Â And just add a l before applying to nonlinearity, the ReLU nonlinearity.

Â And I'm going to call this the shortcut.

Â So rather than needing to follow the main path.

Â The information from a l can now follow a shortcut

Â to go much deeper into the neural network.

Â And what that means is that this last equation goes away and

Â we instead have that the output a l plus 2 is the ReLU

Â nonlinearity g applied to z l plus 2 as before.

Â And now plus a l, so the addition of this a l, it makes this a a residual block.

Â And in pictures,

Â you can also modify this picture on top by drawing this shortcut to go here.

Â And we're going to draw it as going into this second layer here

Â because the shortcut is actually added before the ReLU nonlinearity.

Â So each of these nodes here, where there applies a linear function and their ReLU.

Â So a l was being injected after the linear part but before the ReLU part.

Â And sometimes,

Â instead of the term shortcut, you also hear the term skip connection.

Â And that refers to a l just skipping over a layer or

Â kind of skipping over almost two layers in order to pass this information

Â deeper into the neural network.

Â So what do you inventors of ResNet, so that will be Kaiming He,

Â Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Â What they found was that using residual blocks allows you to train much

Â deeper neural networks.

Â And the way you build a ResNet is by taking many of these residual blocks,

Â blocks like these, and stacking them together to form a deep network.

Â So let's look at this network.

Â This is not a residual network, this called as a plain network.

Â This is a terminology of the ResNet paper.

Â To turn this into ResNet, what you do is you add all those skip connections or

Â all those shortcut connections like so.

Â 4:36

So every two layers ends up with that additional change that we saw

Â on the previous slide to turn each of these into a residual block.

Â So this picture shows five residual blocks stacked together, and

Â this is a residual network.

Â And it turns out that if you use your standard optimization algorithm such as

Â gradient descent, or one of the fancier optimization algorithms.

Â They train a plain network, so without all the extra residual,

Â without all the extra shortcuts or skipped connections I just drew in.

Â Empirically, you find that as you increase the number of layers,

Â the training error will tend to decrease after a while.

Â But then it'll tend to go back up.

Â 5:24

And in theory as you make a neural network deeper,

Â it should only do better and better on the training set, right.

Â So the theory, in theory, having a deeper network should only help but

Â in practice, or in reality, having a plain network.

Â So no ResNet, having plain network that's very deep means that your

Â optimization algorithm just has a much harder time training.

Â And so, in reality,

Â your training error gets worse if you pick a network that's too deep.

Â But what happens with ResNets is that even as the number of layers gets deeper,

Â you can have the performance of the training error kind of keep on going down.

Â Now, even if you train a network with over 100 layers.

Â And then now some people experimenting with networks that over a 1,000 layers,

Â although I don't see that use much in practice yet.

Â But by taking these activations be at X or these intermediate activations and

Â allowing it to go much deeper in the neural network.

Â This really helps with the vanishing and

Â exploding gradient problems and allows you to train much deeper neural networks

Â without really appreciable loss in performance.

Â And maybe at some point, this will plateau, this will fatten out, and

Â it doesn't help that much, the deeper and deeper networks.

Â But ResNet has certainly been effective at helping train very deep networks.

Â So you've now gotten an overview of how ResNets work.

Â And, in fact, in this week's programming exercise,

Â you get to implement these ideas and see it work for yourself.

Â But next, I want to share with you better intuition or

Â even more intuition about why ResNets work so well.

Â Let's go on to the next video.

Â