2:43

So why this notation is convenient?

It is convenient because it is easy to do as you deal with this notation.

Because SGD actually,

just adds these derivative to our weights on the previous situation.

And that is how we make our step in gradient descent, right?

And if you replace our coefficients with matrix and

our gradient step with a matrix as well, then,

you come up with matrix summation or subtraction.

So it can be done efficiently, as well in non pi, without any pi for the loops.

So it's pretty cool to have it in a matrix.

One more thing, is that You can apply a chain rule.

And you can actually write that chain rule in terms of matrix as well.

But let's drill down to that.

Let's try to calculate one element of that matrix dL, dw.

For example, let's try to find the derivative dL dwij.

And to do that,

we need to apply a chain rule because our loss function is a function of z1 and z2.

So we need to go through them when we apply a chain rule and

you can actually see that to get to Wij.

You will effectively come only through Zj,

you will not go to the nearer Zi because it doesn't use those weights.

4:07

So effectively, we have something like this,

the derivative dl dWij is actually dL dZj multiplied by Xi.

This is in turn thanks to how z1 is computed or how zi is computed.

Now, you can actually see that this notation can

be squeezed into matrix notation.

We can actually come up with a gradient vector of our loss with respect to

our outputs, this is dL dz.

And it is called a gradient vector because each element of this vector is actually.

The derivative of loss with respect to every coordinate.

And using this factor you can rewrite our dL dw in matrix notation.

You can actually see that dL dw is x transpose multiplied by dL dz.

You can pause the video and verify that that is true.

So we can see there that you can do backward pass for

MLP efficiently with matrix multiplication.

This is cool because you can do that with non pi and strip away those Python loops.

And you can also do that matrix multiplication efficiently on GPU.

Now, let's see what happens when you have not one x row, but

you actually have a lot of them.

For example,

two of them because we usually do stahestigrate in the mini-batches, right?

So we have a lot of elements.

And let's check that our matrix multiplication paradigm still works here.

You can actually verify that to get the first neuron for

the second sample for the second row in matrix hecks.

You have to do just matrix multiplication.

So take the second row of matrix x and you take the first column of matrix W.

You take a dot product and that is how you get the first neuron for

the second sample, it just works.

Let's move to the backward pass.

Here, the problem is a little bit more difficult.

6:08

To apply SGD with mini batches, we actually need to take the loss for

all the elements in our batch, right?

So for SGD, step we have a lost on our batch.

Let's identify it as Lb.

And we actually need to calculate the derivative of that lost with respect

to our parameters W.

And to do that, you need to apply that gradient to every output that we have and

you just sum them up.

So we just applied the rule here,

that the derivative of the sum is the sum of the derivatives.

Now, let's look at one summoned in that sum.

Let's see how we can calculate the derivative of

loss function with respect to Wij.

And we already know how to do that because we have already done that like two

slides before.

And let's use this notation to come up with the rule for

the mini batch Backward pass.

For two samples, you can actually see that to calculate dLb,

dWij, you need to apply that known rule two times.

So you take dLd, zi1j, xij and so forth.

So you can actually see that what we got is really similar to dot product, right?

And matrix multiplication is all about dot product.

So maybe we can come up with some matrixes that will give us this

result in terms of matrix multiplication.

And you can actually find those matrices.

You can actually see that dLbdw can be compute in a matrix

notation just taking x transpose and multiplying it by dLdz.

Where dLdz is a known thing, it's a derivative

of scalar with the respect of the matrix and we know how to compute that.

You just do that element twice.

X tranpose is a simple thing as well.

You just replace rows with columns and here you are.

Now, let's just check that this rule actually works.

Let's check it for W3,2.

Let's check that if you take the third row from matrix X transpose and

take a dot product with the second column of matrix dLdZ.

That will yield you the formula that we have come up with.

You can pause the video and check that it actually is correct, so it works.

Unfortunately, you should also calculate the derivative of loss

function with respect to x and this is where it is a little bit tricky.

Let's apply a chain rule, so the approach is standard,

let's try to apply a chain rule element wise.

So let's take for example, object i and

let's try to calculate the derivative of a loss

on that object with respect to some feature j of that object.

So how to do that?

Let's apply chain rule.

So first to go to the X, we need to go through Z, right?

Because a chain rule is just a path in the graph.

So let's write that out, and

let's notice that dzik or dxij is actually jjk.

So it is thanks to the fact,

that z is just a linear combination of values futures X.

Then, let's replace w with w transpose and

let's swap the indices of that element.

Then, you can actually see that to calculate our batch

loss with respect to X, you need to calculate dLdZ and

multiply it by W transpose.

Why does that work?

Because you can notice that to make an you need to take a sum of losses,

of derivatives of our losses with respect to matrix X, for every element, right?

For every instance in our batch.

And you can also notice that each instance actually

gives you only one non-zero row in dLdxij.

Because that instance is only dependent on its own features, right?

So that means that effectively, when we are doing a sum of those rows,

we're not making a sum.

We just take all that rows and stack them in a matrix.

So that's why this matrix notation really works.

So this is pretty cool.

You can see that you can apply Backward pass and Forward pass for

mini batches or just for one instance, pretty efficiently.

You can do that with matrix multiplication and you can do that with numpy.

Let's see on one side, just let's summarize what we have come up with.

And now to implement that in numpy.

The Forward pass for a dense layer is done pretty easily.

If Forward pass just gets all your inputs and that is features and waits.

And it takes just the dot product.

So it takes the matrix multiplication because that is how we

do the forward pass.

Backward pass is pretty easy as well.

And let me remind you that in the interface of backward pass,

we also give the incoming gradient.

And that is where it becomes pretty good because we need that

incoming gradient to calculate dx and dw efficiently.

And we actually do write those formulas that you can actually use dL dz and

multiply it, either by double the transpose or

x transpose to get the derivatives of x or w.

So this is implemented pretty efficiently with numpy, as well.

And you can notice one more reason why we use dL dZ in backward pass interface.

Because otherwise, we would have to calculate, for example, dz dx.

And this is something scary because z and x are both matrices.

And it's not clear how to calculate the derivative of matrix

with respect to the other matrix, it's a no go.

So we should have an incoming gradient, and thanks to the incoming gradient,

we can do this efficiently with matrix multiplication.

To summarize, you can do forward pass for

a dense layer with a matrix multiplication.

You can do a backward pass with matrix multiplication as well.

And this is pretty cool, and this is where GPU comes into play.

Because on GPUs, you can crunch matrices pretty fast.

What's more, it's easy to code with numpy and as a matter of fact,

we have an honor assignment for those of you who want to do that with numpy.

In the next video, we will take a quick look at other matrix derivatives.

They're scary, but you should know about them.

[SOUND]