This is elementary college calculus. But just to show the steps on one slide, push the partial derivative through the sum and then apply the Chain Rule to get this two factors. Remember that h, our model, is actually just a linear model, so we'll multiplying the x value times the slope parameter. Adding in the y intercept. And the only term in here that involves theta 1 is this theta 1 times xk. So the other ones don't involve theta 1 at all. So we can just drop them. Their derivative is 0. And so this just simplifies to xk itself. And so (h(xk)- yk)xk is the update rule that we're looking at. So this is just putting the starting point and the final point together on one slide. So the overall update rule looks like theta to the (i + 1) equals theta to the i plus the learning rate times the sum of the value that we predict for xk minus the true value times xk. And so overall what you say here is while not converged, apply these update rules. Update data 0 and then update data 1 and then check for conversions. So that is the program that you could write if you were going to implement this by hand. So one detail here. Why is data 0 for this here 0, why don't have this multiplied by xk? Well, the reason is that because it is the y intercept there is not an x value to multiply by. So its actually just assuming 1.0. You can think of this as adding an extra. If you've gone to a single attribute, or x values, we're trying to predict y from x, we can think of there actually being two attributes. One our x values and the other one being a constant attribute where every single instance in our data is just the value 1.0. So it has no effect on the learning, but it allows us to have you know, an x variable for every parameter. So there's sort of a trivial X0 right here, and you can just assume that X0 is always 1. Okay. In which case they would also appear out here. So some questions that arise for you, there are still some parameters to this process that have kind of leaked in, right? There's this initialization step where we decide to start the process, and we have to choose that point by some mechanism. So you might have sort of small random values, or you might pick always the same point. And random, as usual, gives you a little bit of robustness. Then you also need to decide the step size, and this procedure is notoriously sensitive to choosing a step size. We don't really roll. We're not doing a continuous fall down the gradient, we actually have to jump, so we just figure out how far to jump. If you jump too far, you might actually skip over the minimum and come up with a higher function value rather than a lower one. And some of the methods to decide this learning rate will adapt based on how much of a difference in the error rate you've gotten. So if you're rolling on a relatively flat period you might take a bigger jump while if you're going through a regular steep period you might take a smaller step. Okay, in order to sort of keep approximately constant the amount of gain you get in the error. So that's one trick you can do to sort of avoid this decision about the step size. Otherwise, people sort of tune in the applications a specific way. Okay, and if it's too small, then you'll sort of fall into the gradient. But you'll take a very long time to get there and you have to also doctor in the convergence test as well, whether it's above the threshold. Okay, [MUSIC]