In this video, we'll present another nonparametric regression method called smoothing splines. To motivate smoothing splines, let's consider the model form of y is equal to f of x plus epsilon. This is the standard form that we've been looking at. If you remember back in, when we were studying normal linear regression, this was just Beta note plus Beta_1 x_i. Where we specified the form to be that. But in nonparametric regression, we said that the form of f was not specified and to be estimated from the data. This gave us more flexibility. It said that we don't have to assume a line. We can have something that's nonlinear and that we learn. But if we think back to the analogy of normal linear regression, there we minimize the residual sum of squares, and if you remember that was basically the sum of the Y_i's minus the linear form, so minus that f of x. Sometimes that's also called the MSE for the Mean Squared Error. Now, the problem with this approach in the case where you don't specify the form of f, is that we're estimating as many parameters as data points. In that case, the solution here, if you don't specify the form of f, would just be, so the solution would just be an f-hat of x_i, which would just be the data points y_i. That means you would actually be able to take this MSE and set it equal to 0, by setting every point, every f of x equal to just y. Now that wouldn't be great, because then we would just be interpolating and overfitting the data and not picking up on a Mean trend. There's always this balance between overfitting, we don't want to do that because we don't want to explain the error term because it's not systematic. It can't be explained. There's a balance between that, and then under fitting which is maybe putting a line through data that has a lot of curvature in it, that would be problematic too. Our smoothing methods give us some way to just navigate this balance. We saw that with a kernel estimation. Another method is smoothing splines. Instead of just minimizing the MSE, which on this slide is the term on the left, we can minimize the sum of the MSE and this other term that involves an integral over the squared second derivative. Well, we're going to explore in just a minute what this might mean. It's often called a roughness penalty. The roughness penalty will basically penalize certain functions for being too wiggly or too rough. Then notice that, that Integral term has out in front of it a multiplier Lambda. Now, Lambda is a real number greater than zero, and it's a smoothing parameter. It will control how much the roughness penalty matters in this minimization problem. The first term in the function controls the fit and that's the MSE, and the smaller that term is, the closer the fit will be to the data. You can try to make that term very small by choosing an f that would make your f of x_i very close to your y's. We said the second term and the function controls how smooth or rough the function chosen will be. It penalizes functions that have more curvature. Think about why this is the case. If f of x were linear, then the second derivative would be zero for all x, and the integral would also be zero. That's great for minimization. But we typically don't want to work with a line. Otherwise we'd use parametric regression. The point of this is not to work with lines, but to work with something that has some curvature, just not too much. That means we want this smoothness integral term to actually be positive and not just zero. But what ensures that this term will be positive but small? Well, a small squared second derivative on average over the integrated region would make this integral small. If you think back to calculus that translates to a function that has low curvature on average. The goal here is really for us to see that if we minimize this function, to keep the fit term small, we need something that, a function that fits the data well. To keep the smoothness term small, we need something that has low curvature on average. Now, the Lambda term is the balance. We don't want low curvature to be the only thing that matters. We have to wait the fit with the smoothness. Lambda controls that trade off and of course, if Lambda were equal to zero, then this would reduce to just the MSE solution that we mentioned on the previous slide. That solution would just be interpolation. If we have some positive Lambda, the second term helps with the smoothing. Then if we think about Lambda going to infinity, that converges to the solution f of x-hat equals 0 for all i. We won't go through the details of minimizing this function. But it turns out that the solution to this minimization problem takes on a particular form called a cubic smoothing spline. In order to understand that, let's consider a few definitions. A spline is a piecewise function, where each segment is a polynomial. Splines are often used for interpolation. They give us a way in a mathematically rigorous way, interpolate through some data points. But they don't have to be interpolations, and in the case of a smoothing spline, they're not. A cubic spline is just a spline where the segments for the polynomials are each of degree three. There each cubic. Take a look at the plot on the right, here you'll see some data points. In these splines, they're doing interpolation, which is ultimately not our goal, but our goal right now is just to understand what a spline is. To interpolate through each of these data points, we see two different splines, one is a linear spline, that just means that you connect the points, each with a line, and the line will have two parameters, an intercept and a slope and for each one of those lines, you have to figure out what that intercept and slope should be relatively easy problem. Now the green dashed line is a cubic spline and the cubic spline fits between each of the data points, a cubic polynomial. One of the nice things about splines is that they are defined to be continuous and have continuous derivatives. A smoothing spline is a spline designed to balance goodness of fit with smoothness and the resulting function, the smoothing spline is defined in terms of the minimization of the function from two slides ago. In this figure, the dotted line represents a cubic spline, here we have a cubic spline and the solid line represents a cubic smoothing spline. Notice that with smoothing splines, there is less movement up and down, which is due to constraints placed on the average square second derivative. The regular cubic spline is more wiggly than the cubic smoothing spline, and that's by design. The smoothing splines are designed to balance and fit with smoothness and notice that the cubic spline fits exactly as it interpolates the data. It's fit as exact and the smoothing spline, the fit is not exact, it doesn't run through all the data points and that's because we've placed a constraint for some smoothness. Now in statistics, interpolation is typically not the goal, the goal is to come up with some smoother function that represents the average that could have generated the data that you actually saw. It hopefully seems reasonable that the smoothing spline is the thing that we're interested in, it's something a bit smoother that represents our average that generated the data. If we know the form of the solution to be a cubic smoothing spline, then the estimation problem is reduced to a parametric problem of estimating coefficients of piecewise polynomials that show up in the cubic smoothing spline and we won't go into detail about the mathematics of this estimation, but instead we'll implement smoothing splines in [inaudible] r. At the bottom of this slide I have one possible function that you can use in odder to construct smoothing splines and the two plots will show different values of a function of lambda, but basically different lambdas will give you different fits, and so the plot on the left shows a value of lambda that is small. A small value of lambda, remember, if we go back, small lambda means you're placing low emphasis on smoothness, you're allowing for refer fits. The first plot specifies what's called the spar, spar is just a function of lambda and r and for reasons that we will skirt around, I specify a low spar, which is analogous to just a low lambda. There I specify it to be 0.5. The second plot specifies spar equal to one, that should tell you that, well, low spar low lambda, that means you have a more wiggly function, you didn't penalize enough in this case for roughness and then a higher spar, higher lambda means that you have a smoother function. As with Colonel estimates, there's some subjectivity here, you have to choose lambda and often, the choice of lambda is done through sort of observation. Try some different lambdas and settle on one that you think is reasonable. But there also are automatic methods for choosing lambda, like cross-validation and there are some built-in methods in R that will do the cross-validation for you.