In this video, we'll describe some of the underlying mathematics of basic kernel dimension, including the definitions of a kernel and the bandwidth. These are two important features that show up in kernel estimators. We'll make some recommendations about choices for each, because each one of these things poses some choices in our model. A kernel estimator can be understood as a simple weighted local moving average of the response. To better understand that, let's just consider a relatively simple module. A response is equal to some unknown function of a predictor x plus some noise and we typically make the assumption that our Epsilon term is normally distributed, zero mean, and some constant variance for all I. F is not specified and it's to be estimated from the data. In kernel estimation, our estimator of f evaluated at any point x is the following, it's this f hat down here. This looks somewhat complicated. Let's try to unpack it a bit. We'll try to notice that this is really just a weighted average of the response. Then we'll define the individual terms in the weight and just a bit. Let me highlight this Lambda term. This K is a function of x minus X_I over Lambda. These highlighted terms are basically the weight. If we define W_I to be equal to 1 over Lambda times this function K of x minus X_I over Lambda. We could think about this as the weight in a weighted average. The top would be 1 over n times the sum of W_Y times Y_I, where those W_I are the weights. Sorry, in the denominator, you'll notice that we just have the sum of the weights. The denominator is simply just to ensure that the weights for each Y_I sum to 1 and that makes it a true weighted average. Now let's define the terms that make up this w. This Lambda and this K are undefined. At this point, let's become more clear about what they are. K is a kernel, which means that it's a function that's positive. K of x should be greater than 0 for all x in the domain and it should also be such that it's symmetric, so K of x is equal to K of negative x. This should be true of all values of x. The third condition that defines a kernel is that it integrates to 1 , so it must normalize. That integral will be over the whole support of K. We'll see different examples of kernels and in each case, we'll see that the kernel is defined over a certain domain or support and if we integrate over that domain, we should get 1. It might be nice to notice that two of these things, non-negativity and normalization, are defining features of PDFs and so we can, and we will use PDFs as kernels. For example, we'll look at the normal PDF as a kernel. Here are some commonly used kernels. The uniform kernel is just the continuous uniform PDF on negative 1 to 1. That's one possibility of a choice for a kernel. A second choice might be just using the standard normal PDF. One thing to notice about this one is that in theory, at least this will wait every point when computing our estimate of f. The support of a normal distribution, the domain is the whole real line and in that case, this technically will not be a local weighted average and can be a bit less efficient because it is giving some weight, even though very small weight to all points. But the tails decay pretty quickly to zero and so it assigns a very low weight to points far away from the x point that's in question. A 3rd option for a kernel is the Epanechnikov kernel and like the uniform kernel, it has bounds between negative one and one and so only weights points relatively close to x. How close will depend on Lambda that shows up in the kernel estimator. That we'll define in just a minute. Now, this Epanechnikov kernel can be shown to be optimal. But with that said, the estimation procedure in kernel estimation is typically not too sensitive to the choice of a kernel. In some of the examples that we go through, I'll just use the normal kernel, it's somewhat convenient for computation and the amount of efficiency that you lose with respect to the Epanechnikov is small and so it won't matter all that much. The fit is much more sensitive to the bandwidth and that's this Lambda parameter. Remember, if we go back a few slides, Lambda showed up in this kernel estimator and the fit is much more sensitive to Lambda from the actual kernel. Lambda is called the bandwidth and sometimes it's also called the window width or smoothing parameter. I may use smoothing parameter as often as bandwidth. The bandwidth controls the smoothness of the fitted curve. Some things to note about Lambda. In general, smaller Lambda values will give bumpier fits and larger Lambda values will give smoother fits. I have a plot here that shows a few different values for Lambda. Each one of these I fit using a normal kernel. Notice we have the black curve is Lambda equal to 0.1, the gray curve is Lambda is equal to 2 and the gold curve is Lambda equal to 10. Notice that when Lambda is equal to 0.1, relatively small, we have a very bumpy curve. Something that shoots up and down and it's very sensitive in its fit. The light gray curve seems much more smooth and it actually seems like a much more reasonable model and then the gold curve probably seems a little bit too smooth. It's almost too much like a line fit, which is what we're trying to avoid. We're seeing that if Lambda's too small, the estimator will be too rough, if its too large, important features will be smoothed out. This plot shows a normal kernel with three different smoothing parameters to try to get at the point that it's very sensitive to choices in Lambda. This really begs the question as to how we're supposed to choose this bandwidth and there are different recommendations. Some authors like Faraway claim that rougher fits are less plausible. This black fit would be less plausible since we would not expect the average response to vary so much as a function of the predictor, in this case of age. On the other hand, over smoothing fails to capture systematic variability. We've got this trade off and we need to try to balance between them. Faraway recommends choosing the least smooth fit that does not show any implausible fluctuations. That seems a little bit ambiguous, possibly a bit subjective and he admits as much in the textbook. But that would give us a sense that the gray curve here would be best. The black curve is too rough. The gold curve is to smooth. It seems like the gray curve is right in the middle. A few further notes on the bandwidth. Faraway claims that knowledge about what the true relationship might look like can be readily employed. If we have some knowledge about the true relationship, then we should use that. The problem that I see with this is that part of the argument for using nonparametric regression, is that we don't know what the true relationship is. If we did, parametric regression would be more efficient. I don't see this advice as being super helpful. But if you do have some sense of curvature, maybe just by exploring the data, you can use that to try to guide what you're lambda should be. Another note might be that if f-hat will be used in making predictions of future values, the choice of lambda is consequential. Faraway goes on to defend the subjective approach to choosing lambda. He says this is a quote, "If the method of selecting the amount of smoothing seems disturbingly subjective, we should also understand that selecting a family of parametric models, for example, a standard normal regression model for the same data would also involve a great deal of subjective choice." Although this choice is often not explicitly recognized. Statistical modeling requires us to use our knowledge of what general forms of a relationship might be reasonable. It is not possible to determine these forms from data in an entirely objective manner. Whichever methodology you use, some subjective decisions will be necessary. Here, Faraway is claiming that parametric methods might seem objective, but they are subjective in the sense that we impose certain assumptions on the data, like normality, non-constant variance, etc. Some of those might be subjective, they might depend on the modeler who's working with the data. The same is true he's saying of non-parametric regression and the choice of lambda. All that to say, we should admit that there are some subjective decisions being made, but try to do our best in making those decisions so that they best capture trends in the data. Two final things to note, when you work with some simulations and data in R, you can use the ksmooth function that I have down here. You can use that function to come up with some of these smooths. They take in X and Y values. They will ask you for the type of kernel, which I typically use the normal kernel, but you could also use the uniform box kernel or the Epanechnikov kernel. They also ask for the bandwidth. One final thing to mention, after our discussion of the subjectivity that is involved in choosing lambda, it's important to note that there are some automatic methods for selecting lambda. One of those would be a cross-validation method. It allows you to attempt to automatically select lambda. But we should note that sometimes the cross-validation procedure doesn't give you a plausible lambda, but it might give you something that just has much to bumpy or fit to that then would be plausible. Keep in mind, you have to make some decisions and choosing the bandwidth. There are automatic procedures out there, but they can't be relied on wholesale.