When designing a layer for a ConvNet, you might have to pick,
do you want a 1 by 3 filter,
or 3 by 3, or 5 by 5,
or do you want a pooling layer?
What the inception network does is it says,
why should you do them all?
And this makes the network architecture more complicated,
but it also works remarkably well.
Let's see how this works.
Let's say for the sake of example that you have inputted a
28 by 28 by 192 dimensional volume.
So what the inception network or what an inception layer says is,
instead choosing what filter size you want in a Conv layer,
or even do you want a convolutional layer or a pooling layer?
Let's do them all. So what if you can use a 1 by 1 convolution,
and that will output a 28 by 28 by something.
Let's say 28 by 28 by 64 output,
and you just have a volume there.
But maybe you also want to try a 3 by 3 and that might output a 20 by 20 by 128.
And then what you do is just stack up this second volume next to the first volume.
And to make the dimensions match up,
let's make this a same convolution.
So the output dimension is still 28 by 28,
same as the input dimension in terms of height and width.
But 28 by 28 by in this example 128.
And maybe you might say well I want to hedge my bets.
Maybe a 5 by 5 filter works better.
So let's do that too and have that output a 28 by 28 by 32.
And again you use the same convolution to keep the dimensions the same.
And maybe you don't want to convolutional layer.
Let's apply pooling, and that has some other output and let's stack that up as well.
And here pooling outputs 28 by 28 by 32.
Now in order to make all the dimensions match,
you actually need to use padding for max pooling.
So this is an unusual formal pooling because if you want
the input to have a higher than 28 by 28 and have the output,
you'll match the dimension everything else also by 28 by 28,
then you need to use the same padding as well as a stride of one for pooling.
So this detail might seem a bit funny to you now,
but let's keep going.
And we'll make this all work later.
But with a inception module like this,
you can input some volume and output.
In this case I guess if you add up all these numbers,
32 plus 32 plus 128 plus 64,
that's equal to 256.
So you will have one inception module input 28 by 28 by 129,
and output 28 by 28 by 256.
And this is the heart of the inception network which is due
to Christian Szegedy, Wei Liu,
Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke and Andrew Rabinovich.
And the basic idea is that instead of
you needing to pick one of these filter sizes or pooling you want and committing to that,
you can do them all and just concatenate all the outputs,
and let the network learn whatever parameters it wants to use,
whatever the combinations of these filter sizes it wants.
Now it turns out that there is a problem
with the inception layer as we've described it here,
which is computational cost.
On the next slide,
let's figure out what's the computational cost of this 5 by 5 filter resulting
in this block over here.
So just focusing on the 5 by 5 pot on the previous slide,
we had as input a 28 by 28 by 192 block,
and you implement a 5 by 5 same convolution of 32 filters to output 28 by 28 by 32.
On the previous slide I had drawn this as a thin purple slide.
So I'm just going draw this as a more normal looking blue block here.
So let's look at the computational costs of outputting this 20 by 20 by 32.
So you have 32 filters because the outputs has 32 channels,
and each filter is going to be 5 by 5 by 192.
And so the output size is 20 by 20 by 32,
and so you need to compute 28 by 28 by 32 numbers.
And for each of them you need to do these many multiplications, right?
5 by 5 by 192.
So the total number of multiplies you need
is the number of multiplies you need to compute each
of the output values times the number of output values you need to compute.
And if you multiply all of these numbers,
this is equal to 120 million.
And so, while you can do 120 million multiplies on the modern computer,
this is still a pretty expensive operation.
On the next slide you see how using the idea of 1 by 1 convolutions,
which you learnt about in the previous video,
you'll be able to reduce the computational costs by about a factor of 10.
To go from about 120 million multiplies to about one tenth of that.
So please remember the number 120 so you can compare it
with what you see on the next slide, 120 million.
Here is an alternative architecture for inputting 28 by 28 by 192,
and outputting 28 by 28 by 32, which is falling.
You are going to input the volume,
use a 1 by 1 convolution to reduce the volume to 16 channels instead of 192 channels,
and then on this much smaller volume,
run your 5 by 5 convolution to give you your final output.
So notice the input and output dimensions are still the same.
You input 28 by 28 by 192 and output 28 by 28 by 32,
same as the previous slide.
But what we've done is we're taking this huge volume we had on the left,
and we shrunk it to this much smaller intermediate volume,
which only has 16 instead of 192 channels.
Sometimes this is called a bottleneck layer, right?