Hello again, in this video, I'll show you how you can find the weights for your linear layer and for your word embeddings. To minimize these two matrices or weights vectors, you'll have to minimize the cost, and to minimize the cost, you will use two techniques. The first one is backpropagation. Backpropagation is an algorithm that calculates the partial derivatives or gradients of the cost with respect to the weights and biases of the neural network. Remember that the cost Jay batch is a function of the weights and biases. The back in backpropagation comes from the way the formulas for the derivatives are obtained, using the chain rule for derivatives. Without going into too much detail, using the chain rule involves starting from the output layer, and then working your way back through the layers using derivatives that you have calculated previously. By the way, backpropagation is a prime example of dynamic programming, which you learned about during the first week of this course. The second technique you will use as gradient descent, which adjusts the weights and biases of the neural network using the gradient to minimize the cost. I want to go into the mathematical details of backpropagation or gradient descent. But if you're interested in learning more, check out a course on neural networks to understand how the various formulas are derived. First, let's use back propagation to calculate the partial derivatives of the cost function that you'll need to perform gradient descent. You'll need the partial derivatives of the cost function with respect to the parameters of the neural network, which are in this case w1, w2, b1, and b2. The calculations are long-winded, and in practice you'll use machine learning libraries to handle backpropagation for you. So I'll just give you the formulas, which you'll get to implement in this week's assignment. The derivative of the cost function with respect to the weight in the matrix w1 is as follows. This is the formula for the partial derivative of the cost function with respect to w2, here's the formula for the bias vector b1. Let me pause here for a second. In this formula, I've introduced the 1 underscore m vector, which is a row vector with m elements that are all 1s. If you are multiplying a matrix A that has m rows by the 1m vector transposed, you will get the column vector, where each element is equal to the sum of the elements in the corresponding row of A. Having said that the 1m vector is introduced for mostly formal purposes to be able to write the mathematical formulas. In practice, when you implement this in Python, the easiest way to calculate the sum of the columns is to use NumPy sum function, without using the 1m vector. In the sum function, you need to specify that you want to get the sum of the columns with the parameter axis equal 1. And you will also need to set the keepdims parameter to True, so that the results can be broadcasts into a matrix of whatever size is necessary later in the calculations. Finally, here is the formula for b2, which also uses the 1m vector. Now for the purposes of this course, you don't have to understand how these formulas were derived, you can just use them, okay? So now that you have these gradients, you can use gradient descent to update the weight matrices and bias vectors. The calculations include a learning rate alpha, which is a hyper parameter of your model, and here are the formulas to update the weights and biases. The idea is to take the original parameters and then subtract alpha times their gradient. Since alpha is chosen to be small positive number that is less than 1, the effect of multiplying by alpha is to reduce the amount by which each variable is updated at each step. A smaller alpha allows for more gradual updates to the weights and biases, whereas a larger number allows for a faster updates of the weights. So for instance, the new weight in the matrix w1 would be equal to the original w1 minus alpha times the partial derivative of the cost with respect to w1, which you calculated during the backpropagation step. You now know everything you need to train a continuous bag of words model. In the next video, you'll learn how to extract word embedding vectors from a trained continuous bag of words model.