Why do we need a pooling? Pooling is sub-sampling and pulling have a filter size of two-by-two usually or four-by-four or something like that so integer multiples of two and then usually reduce the size of 1/N. For example, if I had an n by n image, then it's going to be roughly this after the pooling operation. There are many other kinds, but these two are most popular used, max pool, average pool. Max pool means, this is pooling filter and let's say the image overlap on this top of this max pool, maybe the value was 4, 8, 16, 10, then this pooling out of the image four pixels on this image picks the one that has no maximum value out of this four and then assign a representative value, so this is a representation, representative value of six. That's max pool. Maybe I should have 4, 8, 16, 10, next pool. What if I do the average pool? I just take average, so 12, 28, 8 divide by 4, so nine point something, that's average pooling. No matter what the filter size is, this is of two-by-two filter by the way, just pick maximum value and assign to one pixel, take the average of them and assign to one pixel, that's what the pooling operation is doing. As an example here's a Super Mario pixelated image and I'm going to do the max pooling here. Let's say although in RGB term white is 255, things like that, but for the simplicity I'm going to reverse the scale. Let's say why it is zero, and this is one, and as it goes darker the scale increases. When I have a max pool filter like this, then what should be the representative value of this four blocks? This should be white, so I have a white and what about here I have max pooling, the answer will be different, max pool vs average pool because on average here could be close to white so I can put white value or something similar, but here because it's max pool I'm going to choose red, so my representative value exact red here. What is the next side to look like? By the way, I'm jumping two blocks at a time, so stride equals two. Usually, they omit the stride. The next pixel red, this is red, this is red, and this is going to be white, this is white. The brown should be the most intensity, so brown, brown, this color, black, this color, white, white, brown, brown, and so on. You can continue applying max pooling gives this, easy. Let's say blue is stronger than or higher number than red, so I have blue, blue, red, red, this one red, blue, blue, blue, red. That's fairly long. Now I have smaller image. Although I didn't change the scale of this image, but this correspond to this small pixel, so if you scale it right, it's going to be the size of this feature where it would be this small when the original image is this big. Then it's going to be half of the width and half of the height image and that looks like this. It's a little bit more sub-sampled. That was max pooling. Well, here is another way to shrink the spatial dimension so it's different from pooling. Pooling gives you smaller dimension in width and height by subsampling. There's another way and oftentimes people might prefer using convolutions to make it smaller because it can do multiple things than just subsampling, so you can utilize stride, remember that if you had a stride equals two, you can actually shrink the feature map sides into half like this. Here's some exercise, you can pause the video and maybe try to draw by your own what this input image should be look like after the convolution operation with the three-by-three filters. Let's say we have only one number of filters, there's only one and with the strides equals one. Your feature map should look like and also there is no padding. Pause the video and think about it, and solve it for a few seconds or minutes, and then I'm going to pause a little bit. We said there's no padding and the convolution filter is three-by-three that means my convolution filter look like this, but it should also have the depths equals three for the RGB. This is my filter and I'm going to calculate the value for the middle pixel. One operation of this gives one cube. Then because a stride is two, I'm going to go two steps to the right, and then I'm going to make a new folder. I have two here, then I have three here, so actually it seems like my width of my output feature map is going to be three. What about the height? 1, 2, 3, so it's going to be also three pixels for the height, three pixels for the width. This is going to be my output. What about the depths? We said there is only one filter, so that means I have only one channel, one value, and one depth for the depths axis. This is my future map look like. If I had two filters then I would have another layer that's stacked on top of the first one like this. That's a simple calculation. This way you can use a convolution layer to shrink your feature map size.