then through the output layer to get the final output y.

We see here that the output y is a function of the parameters theta.

And remember, that theta comprises the weights and

the biases of our affine transformations inside the network.

Next, we compare our predicted output f of x and theta with the correct output,

f star of x through the loss function.

Remember that the loss function measures how large the error is between the network

output and our true output.

Our goal is to get a small value for the loss function across the entire data set.

We do so by using the loss function as a guide to produce a new set of parameters

theta that are expected to give a lower value of the loss function.

Specifically, we use the gradient of the loss function to modify

the parameters theta.

This optimization procedure is known as gradient descent.

Before describing gradient descent in detail,

let's take another look at the neural network loss function.

Usually, we have thousands of training example pairs,

x and f star of x, available for autonomous driving tasks.

We can compute the loss over all training examples,

as the mean of the losses over the individual training examples.

We can then compute the gradient of the training loss with

respect to the parameters theta which is equal to the mean of the gradient

of the individual losses over every training example.

Here we use the fact that the gradient and the sum are linear operators.

So the gradient of a sum is equal to the sum of the individual gradients.

Using the formulated gradient equation,

we can now describe the batch gradient descent optimization algorithm.

Batch gradient descent is a linear first order optimization method.

Iterative means that it starts from an initial guess of parameters theta and

improves on these parameters iteratively.

First order means that the algorithm only uses the first order derivative to

improve the parameters theta.

Batch gradient descent goes as follows.

First, the parameters theta of the neural network are initialized.

Second, a stopping condition is determined,

which terminates the algorithm and returns a final set of parameters.

Once the iterative procedure begins,

the first thing to be performed by the algorithm is to compute the gradient of

the loss function with respect to the parameters theta, denoted del sub theta.

The gradient can be computed using the equation we derived earlier.

Finally, the parameters theta are updated according to the computed gradient.

Here, epsilon is called the learning rate and controls how much we adjust

the parameters in the direction of the negative gradient at every iteration.

Let's take a look at a visual example of batch gradient descent in the 2D case.

Here, we are trying to find the parameters theta one and

theta two that minimize our function J of theta.

Theta is shaped like an oblong ball shown here with contour lines of equal value.

Gradient descent iteratively finds new parameters theta that take us a step down

the bowl at each iteration.

The first step of the algorithm is to initialize the parameters theta.

Using our initial parameters, we arrive at an initial value for

our loss function denoted by the red dot.

We start gradient descent by computing the gradient of the loss function

at the initial parameter values theta 1 and theta 2.

Using the update step,

we then get the new parameters to arrive at a lower point on our loss function.

We repeat this process until we achieve our stopping criteria.

We then get the last set of the parameters, theta 1 and

theta 2 as our optimal set that minimizes our loss function.

Two pieces are still missing from the presented algorithm.

How do we initialize the parameter's data and

how do we decide when to actually stop the algorithm?

The answer to both of these questions is still highly based on heuristics that work

well in practice.

For parameter initialization, we usually initialized the weights using a standard

normal distribution and set the biases to 0.

It is worth mentioning that there are other heuristics specific to certain

activation functions that are widely used in a literature.

We provide some of these heuristics in a supplementary material.

Defining the gradient descent's stopping conditions is a bit more complex.

There are three ways to determine when to stop the training algorithm.

Most simply, we can decide to stop when a predefined

maximum number of gradient descent iterations is reached.

Another heuristic is based on how much the parameters theta

changed between iterations.

A small variation means the algorithm is not updating the parameters effectively

anymore, which might mean that a minimum has been reached.

The last widely used stopping criteria is the change in the loss function value

between iterations.

Again, as the changes in the loss function between iterations become small,

the optimization is likely to have converged to a minimum.

Choosing one of these stopping conditions is very much a matter of what

works best for the problem at hand.

We will revisit the stopping conditions in the next lesson,

as we study some of the pitfalls of the training process, and how to avoid them.

Unfortunately, the batch gradient descent algorithm suffers from severe drawbacks.

To be able to compute the gradient we use backpropogation.

Backpropogation involves computing the output of the network for

the example on which we would like to evaluate the gradient.

And batch gradient descent evaluates the gradient over the whole training set.

Making it very slow to perform a single update step.

Luckily, the laws function as well as its gradient are means over the training

dataset.

For example, we know that the standard error in a mean estimated from

a set of N samples is sigma over the square root of N.

Where sigma is the standard deviation of the distribution and

N as the number of samples used to estimate the mean.

That means that the rate of decrease in error in the gradient estimate is less

than linear in the number of samples.

This observation is very important, as we now can use a small sub-sample

of the training data or a mini batch to compute our gradient estimate.

So how does using mini batches modify our batch gradient descent algorithm?

The modification is actually quite simple.

The only alteration to the base algorithm is at the sampling step.

Here we choose the sub sample n prime of the training data as our mini batch.

We can now evaluate the gradient and

perform the update steps in an identical manner to batch grading descent.

This algorithm is called stochastic or minibatch gradient descent,

as we randomly select samples to include in the minibatches at each iteration.

However, this algorithm results in an additional parameter to be determined,

which is the size of the minibatch that we want to use.

To pick an appropriate minibatch, it has to be noted that some kinds of hardware

achieve better runtime with specific sizes of data arrays.

Specifically when using GPUs, it is common to use power of two mini batch sizes

which match GPU computing and memory architecture as well.

And therefore, use the GPU resources efficiently.

Let's look at some of the factors that drive batch size selection.

Multi-core architectures such as GPUs are usually under-utilized by extremely

small batch sizes, which motivates using some absolute minimum batch size below

which there's no reduction in the time to process a minibatch.

Furthermore, large batch sizes usually provide a more accurate estimate of

the gradient.

Ensuring descent in a direction that improves the network

performance more reliably.

However as noted previously,

this improvement in the accuracy of the estimate is less than linear.

Small batch sizes on the other hand have been seen to offer a regularlizing effect.

With the best generalization often seen at a batch size of one.

If you're not sure what we mean by generalization, don't worry.

As we'll be exploring it more closely in the next lesson.

Furthermore, optimization algorithms usually converge more quickly if they're

allowed to rapidly compute approximate estimates of the gradients and iterate

more often rather than computing exact gradients and performing fewer iterations.

As a result of these trade-offs, typical power of two mini batch sizes range

from 32 to 256, with smaller sizes sometimes being attempted for

large models or to improve generalization.

One final issue to keep in mind is the requirement to shuffle the dataset before

sampling the minibatch.

Failing to shuffle the dataset at all can reduce the effectiveness of your network.

There exist many variants of stochastic gradient descent in the literature,

each having their own advantages and disadvantages.

It might be difficult to choose which variant to use, and

sometimes one of the variants works better for certain problem than another.

As a simple rule of thumb for

autonomous driving applications, a safe choice is the ADAM optimization method.

It is quite robust to initial parameters theta, and widely use.

If you are interest in learning more about this variance,

have a look at the resources listed in the supplemental notes.

In this lesson, you learned how to optimize the parameters

of a neural network using batch gradient descent.

You also learned that there are a lot of proposed variance of this optimization

algorithm, with a safety fault choice being ADAM.

Congratulations, you've finished the essential steps required to build and

train an neural network.

In the next lesson, we will discuss how you can choose some of the optimization

parameters to improve network training, such as the learning rate.

Also we'll discuss how to evaluate the performance of our neural network using

validation sets.

See you next time.

[MUSIC]