In this video, we'll motivate the use of kernel density estimators with data that exhibit non-linear trends. In particular, we'll show why normal linear regression, even with polynomial terms, is not really an optimal modeling choice, and why a non-parametric method like kernel estimation might be better. Let's take a look at this bone density Data. We'll use this to show some of the failings of normal linear regression. This bone Data is in the elementary statistical learning package in R. It has 485 observations on four different variables, but we're just pulling out two for simplicity. Our response here is a relative spinal bone mineral density measurement, and our predictor is age. You can tell from the plot here that it doesn't appear like there's really a linear relationship, it appears like maybe there was a lot of noise in the data, but that the true relationship seems to be non-linear. Perhaps a better way to diagnose non-linearity would be to look at the residuals versus fitted values of this linear regression model. I have that plot here, and I think it's relatively clear to see first that there's an evidence of non-constant variance perhaps. We have maybe smaller variance here and much larger variance here. Also that there's some structure here, so it seems like there might be some non-linear structure that's not being captured and that's showing up in the residuals. That seems problematic and we should not use the simple linear regression model, but try to figure out something else. What might that something else be? Well, we could try something like adding polynomial terms to the model. We might say, well, in our response Y_i is equal to an intercept term plus a slope times our predictor, plus another slope times the same predictor squared. Notice here that all of these predictors are the same, we're just raising them to higher and higher powers. D is just the degree of the polynomial that we'd like to fit. Of course, the natural question is, how do we choose D? There are some heuristics that have been offered, and I'll mention two. One is that we can start with D equals one, and add terms until they're not statistically significant anymore. We might start with one, if we have a statistically significant simple linear regression model, then bump-up D to two. Add in the square term, if that's significant, add a cubic term, etc, until we find one that's not statistically significant. The other option is that we could start with a large D and eliminate terms that are not statistically significant, starting with the highest order terms. Just note that it's usually a bad idea, for example, to eliminate like an x squared term, but keep in an x cubed term. If you start with d equal to five, for example, then you should eliminate that term if it's not statistically significant, but not eliminate something lower first, one problem with these heuristics is that they don't yield consistent results. In this example, the first method would suggest that we leave D equals one, so that's just the simple linear model, but of course that's not right. The second method suggests a degree four polynomial. That's the fifth that I have here. Notice that that doesn't seem too bad, that does seem to capture some of the curvature in this data. However, the residuals with this model, the degree four polynomial, still show some signs of misfit. Worse, suppose we had two predictors, which is suggested by the data set. If you take a look at this full data set, there are several predictors. What if we had to include more? We would need to decide on D1, the degree of the polynomial term associated with predictor one, and D2 associated with predictor two. That process would become really messy. How would we think about eliminating different terms when we have many possible predictors in this example too. It would be great to come up with an automated way to decide on the form of x, not just sort of picking polynomials that seem right, but it might be nice to have the Data try to show us what f is. Kernel smoothing as one non-parametric method for trying to do that. Kernel smoothers are non parametric methods for choosing the nonlinear structure that best fits the data. Here, I just give a plot of the same data with a kernel estimator superimposed on top. This looks pretty similar to the degree four polynomial, but we didn't have to make choices about what the degree of the polynomial is. But we do have to make some choices and there are some downsides to kernel estimators and in the next video, we'll look at the mathematics behind kernel estimators and think about some of the trade-offs that are involved.