0:00

In this video, you'll learn about some of

Â the classic neural network architecture starting with LeNet-5,

Â and then AlexNet, and then VGGNet. Let's take a look.

Â Here is the LeNet-5 architecture.

Â You start off with an image which say,

Â 32 by 32 by 1.

Â And the goal of LeNet-5 was to recognize handwritten digits,

Â so maybe an image of a digits like that.

Â And LeNet-5 was trained on grayscale images,

Â which is why it's 32 by 32 by 1.

Â This neural network architecture is actually quite

Â similar to the last example you saw last week.

Â In the first step,

Â you use a set of six,

Â 5 by 5 filters with a stride of one because you use

Â six filters you end up with a 20 by 20 by 6 over there.

Â And with a stride of one and no padding,

Â the image dimensions reduces from 32 by 32 down to 28 by 28.

Â Then the LeNet neural network applies pooling.

Â And back then when this paper was written,

Â people use average pooling much more.

Â If you're building a modern variant,

Â you probably use max pooling instead.

Â But in this example,

Â you average pool and with a filter width two and a stride of two,

Â you wind up reducing the dimensions,

Â the height and width by a factor of two,

Â so we now end up with a 14 by 14 by 6 volume.

Â I guess the height and width of these volumes aren't entirely drawn to scale.

Â Now technically, if I were drawing these volumes to scale,

Â the height and width would be stronger by a factor of two.

Â Next, you apply another convolutional layer.

Â This time you use a set of 16 filters,

Â the 5 by 5, so you end up with 16 channels to the next volume.

Â And back when this paper was written in 1998,

Â people didn't really use padding or you always using valid convolutions,

Â which is why every time you apply convolutional layer,

Â they heightened with strengths.

Â So that's why, here,

Â you go from 14 to 14 down to 10 by 10.

Â Then another pooling layer,

Â so that reduces the height and width by a factor of two,

Â then you end up with 5 by 5 over here.

Â And if you multiply all these numbers 5 by 5 by 16,

Â this multiplies up to 400.

Â That's 25 times 16 is 400.

Â And the next layer is then a fully connected layer that fully connects each of

Â these 400 nodes with every one of 120 neurons,

Â so there's a fully connected layer.

Â And sometimes, that would draw out exclusively

Â a layer with 400 nodes, I'm skipping that here.

Â There's a fully connected layer and then another a fully connected layer.

Â And then the final step is it uses

Â these essentially 84 features and uses it with one final output.

Â I guess you could draw one more node here to make a prediction for Å·.

Â And Å· took on 10 possible values

Â corresponding to recognising each of the digits from 0 to 9.

Â A modern version of this neural network,

Â we'll use a softmax layer with a 10 way classification output.

Â Although back then, LeNet-5 actually use a different classifier at the output layer,

Â one that's useless today.

Â So this neural network was small by modern standards,

Â had about 60,000 parameters.

Â And today, you often see neural networks with

Â anywhere from 10 million to 100 million parameters,

Â and it's not unusual to see networks that are

Â literally about a thousand times bigger than this network.

Â But one thing you do see is that as you go deeper in a network,

Â so as you go from left to right,

Â the height and width tend to go down.

Â So you went from 32 by 32, to 28 to 14,

Â to 10 to 5, whereas the number of channels does increase.

Â It goes from 1 to 6 to 16 as you go deeper into the layers of the network.

Â One other pattern you see in this neural network that's still often repeated today is

Â that you might have some one or more conu layers followed by pooling layer,

Â and then one or sometimes more than one conu layer followed by a pooling layer,

Â and then some fully connected layers and then the outputs.

Â So this type of arrangement of layers is quite common.

Â Now finally, this is maybe only for those of you that want to try reading the paper.

Â There are a couple other things that were different.

Â The rest of this slide,

Â I'm going to make a few more advanced comments,

Â only for those of you that want to try to read this classic paper.

Â And so, everything I'm going to write in red,

Â you can safely skip on the slide,

Â and there's maybe an interesting historical footnote

Â that is okay if you don't follow fully.

Â So it turns out that if you read the original paper, back then,

Â people used sigmoid and tanh nonlinearities,

Â and people weren't using value nonlinearities back then.

Â So if you look at the paper, you see sigmoid and tanh referred to.

Â And there are also some funny ways about

Â this network was wired that is funny by modern standards.

Â So for example, you've seen how if you have a nh by nw by nc network with

Â nc channels then you use f by f by nc dimensional filter,

Â where everything looks at every one of these channels.

Â But back then, computers were much slower.

Â And so to save on computation as well as some parameters,

Â the original LeNet-5 had some crazy complicated way

Â where different filters would look at different channels of the input block.

Â And so the paper talks about those details,

Â but the more modern implementation wouldn't have that type of complexity these days.

Â And then one last thing that was done back then I guess but isn't really done right

Â now is that the original LeNet-5 had a non-linearity after pooling,

Â and I think it actually uses sigmoid non-linearity after the pooling layer.

Â So if you do read this paper,

Â and this is one of the harder ones to read than

Â the ones we'll go over in the next few videos,

Â the next one might be an easy one to start with.

Â Most of the ideas on the slide I just tried in sections two and three of the paper,

Â and later sections of the paper talked about some other ideas.

Â It talked about something called the graph transformer network,

Â which isn't widely used today.

Â So if you do try to read this paper,

Â I recommend focusing really on section two which talks about this architecture,

Â and maybe take a quick look at section three

Â which has a bunch of experiments and results, which is pretty interesting.

Â The second example of a neural network I want to show you is AlexNet,

Â named after Alex Krizhevsky, who was the first author of the paper describing this work.

Â The other author's were Ilya Sutskever and Geoffrey Hinton.

Â So, AlexNet input starts with 227 by 227 by 3 images.

Â And if you read the paper,

Â the paper refers to 224 by 224 by 3 images.

Â But if you look at the numbers,

Â I think that the numbers make sense only of actually 227 by 227.

Â And then the first layer applies a set of 96 11 by 11 filters with a stride of four.

Â And because it uses a large stride of four,

Â the dimensions shrinks to 55 by 55.

Â So roughly, going down by a factor of 4 because of a large stride.

Â And then it applies max pooling with a 3 by 3 filter.

Â So f equals three and a stride of two.

Â So this reduces the volume to 27 by 27 by 96,

Â and then it performs a 5 by 5 same convolution,

Â same padding, so you end up with 27 by 27 by 276.

Â Max pooling again, this reduces the height and width to 13.

Â And then another same convolution, so same padding.

Â So it's 13 by 13 by now 384 filters.

Â And then 3 by 3, same convolution again, gives you that.

Â Then 3 by 3, same convolution, gives you that.

Â Then max pool, brings it down to 6 by 6 by 256.

Â If you multiply all these numbers,6 times 6 times 256, that's 9216.

Â So we're going to unroll this into 9216 nodes.

Â And then finally, it has a few fully connected layers.

Â And then finally, it uses a softmax to output

Â which one of 1000 causes the object could be.

Â So this neural network actually had a lot of similarities to LeNet,

Â but it was much bigger.

Â So whereas the LeNet-5 from previous slide had about 60,000 parameters,

Â this AlexNet that had about 60 million parameters.

Â And the fact that they could take

Â pretty similar basic building blocks that

Â have a lot more hidden units and training on a lot more data,

Â they trained on the image that dataset that

Â allowed it to have a just remarkable performance.

Â Another aspect of this architecture that made it much

Â better than LeNet was using the value activation function.

Â And then again, just if you read the bay paper

Â some more advanced details that you don't really need

Â to worry about if you don't read the paper, one is that,

Â when this paper was written,

Â GPUs was still a little bit slower,

Â so it had a complicated way of training on two GPUs.

Â And the basic idea was that,

Â a lot of these layers was actually split across two different GPUs and there was

Â a thoughtful way for when the two GPUs would communicate with each other.

Â And the paper also,

Â the original AlexNet architecture also had another set of a layer

Â called a Local Response Normalization.

Â And this type of layer isn't really used much,

Â which is why I didn't talk about it.

Â But the basic idea of Local Response Normalization is,

Â if you look at one of these blocks,

Â one of these volumes that we have on top,

Â let's say for the sake of argument, this one,

Â 13 by 13 by 256,

Â what Local Response Normalization,

Â (LRN) does, is you look at one position.

Â So one position height and width,

Â and look down this across all the channels,

Â look at all 256 numbers and normalize them.

Â And the motivation for this Local Response Normalization was that for

Â each position in this 13 by 13 image,

Â maybe you don't want too many neurons with a very high activation.

Â But subsequently, many researchers have found that this doesn't help that much so this is

Â one of those ideas I guess I'm drawing in red

Â because it's less important for you to understand this one.

Â And in practice, I don't really use

Â local response normalizations really in the networks language trained today.

Â So if you are interested in the history of deep learning,

Â I think even before AlexNet,

Â deep learning was starting to gain traction in speech recognition and a few other areas,

Â but it was really just paper that convinced a lot of

Â the computer vision community to take a serious look at

Â deep learning to convince them that deep learning really works in computer vision.

Â And then it grew on to have a huge impact not

Â just in computer vision but beyond computer vision as well.

Â And if you want to try reading some of these papers

Â yourself and you really don't have to for this course,

Â but if you want to try reading some of these papers,

Â this one is one of the easier ones to read so this might be a good one to take a look at.

Â So whereas AlexNet had a relatively complicated architecture,

Â there's just a lot of hyperparameters, right?

Â Where you have all these numbers

Â that Alex Krizhevsky and his co-authors had to come up with.

Â Let me show you a third and final example on this video called the VGG or VGG-16 network.

Â And a remarkable thing about the VGG-16 net is that they said,

Â instead of having so many hyperparameters,

Â let's use a much simpler network where you focus on just having conv-layers

Â that are just three-by-three filters with a stride of one and always use same padding.

Â And make all your max pulling layers two-by-two with a stride of two.

Â And so, one very nice thing about

Â the VGG network was it really simplified this neural network architectures.

Â So, let's go through the architecture.

Â So, you solve up with an image for them and then the first two layers are convolutions,

Â which are therefore these three-by-three filters.

Â And in the first two layers use 64 filters.

Â You end up with a 224 by 224 because using same convolutions and then with 64 channels.

Â So because VGG-16 is a relatively deep network,

Â am going to not draw all the volumes here.

Â So what this little picture denotes is what we would previously have

Â drawn as this 224 by 224 by 3.

Â And then a convolution that results in I guess a 224

Â by 224 by 64 is to be drawn as a deeper volume,

Â and then another layer that results in 224 by 224 by 64.

Â So this conv64 times two represents that you're doing two conv-layers with 64 filters.

Â And as I mentioned earlier,

Â the filters are always three-by-three

Â with a stride of one and they are always same convolutions.

Â So rather than drawing all these volumes,

Â am just going to use text to represent this network.

Â Next, then uses are pulling layer,

Â so the pulling layer will reduce.

Â I think it goes from 224 by 224 down to what?

Â Right. Goes to 112 by 112 by 64.

Â And then it has a couple more conv-layers.

Â So this means it has 128 filters and because these are the same convolutions,

Â let's see what is the new dimension.

Â Right? It will be 112 by 112 by 128 and then

Â pulling layer so you can figure out what's the new dimension of that.

Â And now, three conv-layers with

Â 256 filters to the pulling layer and then a few more conv-layers,

Â pulling layer, more conv-layers, pulling layer.

Â And then it takes this final 7 by 7 by 512 these in to fully connected layer,

Â fully connected with four thousand ninety six

Â units and then a softmax output one of a thousand classes.

Â By the way, the 16 in the VGG-16

Â refers to the fact that this has 16 layers that have weights.

Â And this is a pretty large network,

Â this network has a total of about 138 million parameters.

Â And that's pretty large even by modern standards.

Â But the simplicity of the VGG-16 architecture made it quite appealing.

Â You can tell his architecture is really quite uniform.

Â There is a few conv-layers followed by a pulling layer,

Â which reduces the height and width, right?

Â So the pulling layers reduce the height and width.

Â You have a few of them here.

Â But then also, if you look at the number of filters in the conv-layers,

Â here you have 64 filters and then you double to 128 double to 256 doubles to 512.

Â And then I guess the authors thought 512 was big enough and did double on the game here.

Â But this sort of roughly doubling on every step,

Â or doubling through every stack of conv-layers was

Â another simple principle used to design the architecture of this network.

Â And so I think the relative uniformity of

Â this architecture made it quite attractive to researchers.

Â The main downside was that it was

Â a pretty large network in terms of the number of parameters you had to train.

Â And if you read the literature,

Â you sometimes see people talk about the VGG-19,

Â that is an even bigger version of this network.

Â And you could see the details in the paper cited at

Â the bottom by Karen Simonyan and Andrew Zisserman.

Â But because VGG-16 does almost as well as VGG-19.

Â A lot of people will use VGG-16.

Â But the thing I liked most about this was that,

Â this made this pattern of how,

Â as you go deeper and height and width goes down,

Â it just goes down by a factor of two each time for

Â the pulling layers whereas the number of channels increases.

Â And here roughly goes up by a factor of two every time you have a new set of conv-layers.

Â So by making the rate at which it goes down and that go up very systematic,

Â I thought this paper was very attractive from that perspective.

Â So that's it for the three classic architecture's.

Â If you want, you should really now read some of these papers.

Â I recommend starting with the AlexNet paper followed by the VGG net paper and

Â then the LeNet paper is a bit harder to

Â read but it is a good classic once you go over that.

Â But next, let's go beyond these classic networks and look at some even more advanced,

Â even more powerful neural network architectures. Let's go into.

Â