So in order to get an idea how neural networks are trained, we'll use a very simple neural network in Python. First of all, let's recap how to so-called feedforward path for feedforward network is computed. Remember, that you have an input vector X and the label vector Y. Especially in the case of classification, Y is a vector. For regression, Y is often a scalar but for the math it doesn't matter, since we are working with vectors and matrices anyway. So, the task of a neural network now is to compute Y_hat from an input vector X using the weight matrices W1 and W2 such that Y_hat and Y are as close as possible for all vectors in the training set X. Let's actually have a look at the map. So to compute from left to right, the first thing you do is multiply X by W1 to obtain C2. Since we are squashing intermediate layer calculations with activation functions, we have to apply the activation function to C2 to obtain A2. Then we compute the output layer. So note that we are multiplying A2 by W2. This is some sort of stacking of the computations on top of each other. And finally, we compute Y_hat by applying the activation function to C3 and we are done. So our weight matrices W1 and W2 are still randomly initialized. Therefore, Y_hat and Y are different. So now, we have to train this neural network to do a better job and training basically means updating the weights in order to obtain the best Y_hat on a given X with respect to Y for all data points in X. But how do we find the optimal values for the weights? This is the plot, our two weight values and the cost. So the cost basically tells you how could or better your weights are chosen. The lower the cost, the better neural network performance. The cost function is called tree. Note that it depends on W and T. So W is the weight matrix of the neural network. This is the one you want to optimize over. And T stands for a training dataset, which stays always the same. Therefore, tree is dependent on W. This is a two-dimensional example, but in reality, this so-called optimization space is very high-dimensional and impossible to plot. So, how can we now find good values for W? Of course, we can check every possible combination of values for W or can't we? So one way of doing this is called Grid Search. So, in this example, we choose a discrete set of values within W1 and W2 and just apply the cost function tree to all of those combinations and the one combination with the lowest cost is an optimal value set for W. Now, imagine that each individual scalar weight can be assigned to a value between minus 1 and 1. Now, we create a grid over this continuous range with the step size of 0.1. So this basically forms a grid, therefore, the method is called Grid Search. And now imagine, we have only a single scalar weight value to optimize over. So, we have to test 21 individual values. If we add one more weight, we already have to test 411, in three dimensions, we are on 9,261 combinations. In four dimensions, we already have to test and compute 194,481 values. And as we see, this number keeps growing exponentially. With only 100 dimensions, you have to test more combinations than the atoms in the universe. So, this obviously is not a good way for neural network training. So, what if we not strictly follow a grid and just randomly select value combinations for the weights. So it turns out that this method, which is called Monte Carlo, outperforms Grid Search. But still in order to get a good approximation of the cost function, we need to test a lot of values which is not feasible for neural network training. So what if we had a possibility when evaluating a single value to know in which direction we should go in order to test for the next value. We then could climb down the ladder one step by another until we reach the optimal value. This method is called Gradient Descent and with all its variants, it's [inaudible] for neural network training. So how does this work? Let's have a look at our cost function tree. In this case, the so-called quadratic cost, which basically sums the squares of the differences between the prediction of the neural network and the actual real value. This difference is called error. Therefore, this is also called a sum of squared errors. The axis of a cost function as well. The cross entropy, exponential cost, hellinger distance, and many more. Choosing an appropriate cost function depends on your data and neural network topology and is part of the so-called hyperparameter space. An idea selection is often considered as black magic, in other words, trial and error. So if we now have a look, how Y_hat is calculated, we notice that this is dependent on the activation function F and on C3. C3 is calculated by the activation of the previous layer multiplied by the weights of the actual layer. If you replace the two by its definition, you obtain the following form and if you replace C2, it's X multiplied by the weight of layer 1 and we obtain this form. Finally, you put this form into tree and we obtain the cost function tree based on all individual computations of the forward pass of the neural network. So, now, we will use the so-called backpropagation method. This calculates the neural network computations backwards and it does so by back propagating the error Y minus Y_hat. Intuitively, this means that on the reverse computation of the neural network, big weight values end up in back propagating big amounts of the error to the upstream layers, where a small weight value only let small amounts of the error being back propagated. So let's assume, on a given data row, we end up with a certain amount of error indicated by red circle. So, if we now backpropagate this error, we use the weight to determine how much of this error should be back propagated. Since the weight is high, we also increase the error, which then gets back propagated. Now, if the weight is small, the error also gets reduced and the amount of error for that particular back propagation to the upstream layer neuron is less. This is the desired behavior since big weights are contributing more to the overall error than small weights. So, this is nice but what we actually want is not the error or cost on each neuron, but the gradient of each error. If we know the gradient, we know in which direction we have to update each weight in order to go down here to cross the hypersurface. Therefore, we compute the partition derivative with respect to W2 first. I skipped the linear algebra and the calculus here, but it turns out that derivative is A2 transpose times delta 3, which is based under reverse computation, which is also called backpropagation. So note that F_prime is the first derivative of the activation function F. Now, we continue in back propagating the error and taking the first derivative of that function in parallel to obtain delta 2. And finally, we get the partial derivative with respect to W1. And again, the mathematical reason why this is computed using the transposed of X times delta 2 is beyond the scope of this class and due to the Chain rule in calculus. So as you have seen, training the neural network can be quite complicated. But it's okay if you didn't understand all the details, just note the gradient descent does not make any guarantees that it converges just to the global optimum unless the cost function is convex, which is not the case in neural networks. So it might get stuck in local optimas and even has not any guarantees to convert it all. Therefore, a lot of hyperparameter tuning might be necessary. In the next module, we will cover TensorFlow, one of the hottest deep learning frameworks out now.