In this video, we'll briefly discuss local polynomial regression and in particular, we'll look at a method called the lowest or the locally estimated scatter-plot smoothing method. This method combines parametric linear regression techniques with local fitting ideas that we learned about in kernel estimation. Let's think again about our general structure and fitting a statistical model, we often assume that the relationship between the response YI and the predictor or set of predictors XI looks like the following. YI is equal to some function of X plus some epsilon term, which is thought to be some sort of measurement error. We typically make some assumptions on that epsilon term, like it's normally distributed around zero with maybe a constant variance and they're independent. Now F of X is the true but unknown mean function that we think generated the data. To learn F, we might be interested in minimizing some objective function. We have one example of an objective function here. We could call it the MSE for the mean squared error. We could also call it as we did in the regression context, the residual sum of squares or some function of the residual sum of squares here we're dividing by N. But basically what we have here is the error between the eighth Y measurement and the function evaluated at X. We square that and we sum it up and if we have the MSE we divide by the number of data points that we have. This is supposed to give us some kind of metric of how well the model fits. Now of course, without further assumptions or constraints, this problem is either trivial, so it yields the solution that F of XI is equal to YI. That would make this whole equation equal to zero. If we set F of XI equal to YI for all I from one up to N. That's not helpful in the statistical contexts because it's just an interpolation. It doesn't tell us any new information. It's modeling the noise, things like that. But also, if we don't just use that interpolation is intractable because we would have an infinite number of other possible solutions that you can have that fit the data, either going through every point or some sort of smoothing. But we don't have any guidance on what that smoothing should look like. Basically we need to put some other constraints here on F in order to estimate it. What sorts of constraints have we thought about putting on F and other contexts? Well, in the linear regression context, we simply set F of X equal to a linear function. I'll just write down the simple linear regression function, Beta naught plus Beta one times XI and here, that assumption was pretty strong. We said that the function is actually linear and what we'll do is estimate what the intercept and the slope are of that linear function. Now in the multivariate case, you're estimating several slopes and you're fitting a plane or a hyperplane. But the same idea holds is that you're severely restricting the number of possible fits that would be an estimator of the true F. Those are basically equivalent to putting certain assumptions there and assumption of linearity, constraint of linearity, etc. Then once we've placed that constraint, we have some nice tools available to us. Linear least squares, or in some cases that's equivalent to maximum likelihood estimation. We have now nice ways of estimating F. But of course, as we see in this plot here, some data are not linear and so we can fit a linear model and we clearly have some bias. We clearly have a model that does not fit properly. In this course, we've started to look at how to avoid the linearity assumptions so that we can have a more flexible fit. We've looked at smoothing splines for example and there we took the MSE, but we added a penalty term. That was that Lambda times the integral of the squared second derivative over X. That pose some constraints, it constrained our function to be something that's at least twice differentiable. But it allowed us to be much more flexible in our fits. It allowed us to estimate f to be something nonlinear. Another option, is the following, we could replace our f of x in our model with a Taylor expansion of f of x around a particular point, say x knot. We have that equation here. Taylor expansion says that, f of x, the function that we care about, is equal to some polynomial. So we take f and we evaluate it at x_0 plus the first derivative evaluated at x_0 over, in this case, one factorial times x minus x_0 and then plus the second derivative over two factorial times x minus x knot squared, etc, up to some degree p term. Now let's think about why this is helpful. Well remember, f of some arbitrary function, the Taylor expansion tells us that close to x knot, we're now saying that f of x is approximately equal to a polynomial with particular coefficients. We can really think about this term here, for example, as sum beta 1. This might be our beta knot, our constant term. Here this whole term is a beta 2, etc, this term here is say a beta p. We can interpret this as a polynomial of degree p with coefficients that I'm now calling betas. That's nice, because that puts us back into the linear regression context. Of course, it's not linear in terms of x, but it is linear in terms of our coefficients or our parameters of the model. That's really nice, because we can use linearly squares or maximum likelihood estimation in a pretty straightforward way. That's a really interesting move, we went from an arbitrary f to approximating f at least around a particular point x knot using a polynomial. We can obtain estimates of each of our coefficient terms, so that's these terms here. That's just equal to the jth derivative evaluated at x note divided by j factorial. We can estimate those terms using ordinary squares based on this equation here. We really just have a linearly squares problem, and it just turns out to be polynomial regression. This is really almost what happens when we do a lowest fit, but typically p is equal to one in such cases. We do a pretty rough approximation of f, we just do a linear approximation. Sometimes there's a quadratic approximation that depends on the algorithm, but usually it's a pretty low order approximation. But the idea is that we really restrict the window around x knot, so we only take relatively few points in a very small window, and use those points to estimate the linear fit close to those points and then we move across the whole x domain. It might be interesting to note that if p is equal to zero, so you just have a constant term, then the lowest method is equivalent to the kernel estimator that we discussed in an earlier lesson. As I've hinted at already thus far, we would really only be estimating the fit around x knot, and that's only really good for x points near x knot. We can only trust this approximation pretty close to the point x knot. But our minimization problem, the way we've written it here, uses information from all points. It turns out that we can fix this problem and use only data points close to x knot by modifying our MSE using the Taylor expansion here to weight points close to x knot more heavily than points far away from x knot. That idea is shown here, here I have the MSE using the Taylor expansion and using a weight function. The important thing here is for us to describe this weight function. What it really does is directly MSE to only care about points very close to x naught and not care about points that are far away from x naught. There are different possible weight functions that you can use, a standard one that's used is the following. W_i of x is equal to either one minus the absolute value of x cubed, all cubed, if the absolute value of x is less than one and one if the absolute value of x is greater than or equal to one. Now, of course, notice what we plug into this function in the objective function in that MSE is not just x, but it's x minus x naught. I just wrote it in terms of x to illustrate what the function is, but you're plugging in x minus x naught. This weight function allows us to only take into account information close to x naught, which makes the approximation, this Taylor expansion approximation pretty good. Luckily, R has easy ways of fitting the lowest model to the data, and we'll take a look at these fitting methods in R in the next lesson. I wanted to mention a few advantages and a few disadvantages to the lowest method. We can see some of them here. First, some advantages, the really nice thing is that the lowest fit does not require pre-specified functional form, for example, a line as we do in standard linear regression. That really means that the lowest provides a flexible fit that can account for non-linear trends in the data without having to specify the non-linear trend before observing the data. Of course, if you knew that your data had some non-linear trend, you might want to specify that. But in such contexts where you don't know what the non-linear trend might look like, the lowest is pretty good at learning what the trends might be. Another advantage is that the method is pretty simple to implement. One simply fits linear polynomial regressions with small windows around data points in R. That makes it easy to understand some of the numerical techniques in the R functions that we'll look at can be a bit more complicated. I'll point you towards some of that literature. But basically what I want to say here is that it's pretty simple to understand in theory what the implementation is. Another great advantage of the lowest model is that it relies on the theory for weighted regression, and thus it's possible to quantify uncertainties in the model in much the same way that you would do for linear regression, for example. If you wanted to create confidence intervals for the mean value of the response or to create a prediction interval for a new value of the response given some predictive value, and you wanted to use the lowest model, you can use the predict function in R and we'll illustrate how to do that in the next lesson. But it's nice because there's some theory, the theory is worked down in the predict function and you can pretty easily make predictions using the lowest fit. The predictions could come with confidence intervals and there are ways to get prediction intervals too. Now, I'll give you some resources so that you can, on your own, understand the mathematical and statistical details of how the standard errors, confidence bands, etc, are calculated, and you'll see those resources on the platform. There are a few disadvantages. One is that it can be computationally expensive to fit the lowest model when compared to say, a linear regression model. Another disadvantage is that it's not as easy to interpret as standard linear regression. Remember, standard linear regression has a nice interpretation in terms of a one-unit increase in a predictive value. We don't necessarily get that here from the lowest fit, although we could potentially come up with other interpretations. The last disadvantage is that it typically requires a pretty large and densely sampled data set in order to produce a good model. If you have a pretty sparse data set, then that could be problematic in giving a nice fit. In the next lesson, we'll take a look at how to fit and interpret a lowest regression in R.