In this section, we'll give brief treatise to the linearity assumption underlying multiple linear regression, and contrast that with what the linearity assumption was for simple linear regression. We'll also talk about estimating the amount of variability explained by multiple predictors. And finally, we'll talk about some strategies for choosing a quote-unquote final regression model depending on the goals of the research. So upon completion of this lecture section you will be able to speak about the linearity assumption related to continuous predictors in multiple regression and have an idea of how to assess that visually. You should expect to see comments about investigating this linearity assumption in articles where continuous predictors are used in simple and multiple aggression models. You should be thinking about this if you read an article where linear regression is the main analysis tool. And discuss potential strategies for choosing the final multiple linear regression model among candidates with different predictor sets. We explain some shortcomings of R squared for assessing the prediction of a model. So let's just talk about the linearity assumption and simple linear regression, and then segue into it from multiple linear regression. By using an example we're well familiar with, arm circumference, predictors of arm circumference. And let's just refresh our memory about assessing this assumption, simple linear regression. So we were looking at the relationship between arm circumference and height for data on anthropometric measures from a random sample of 150 Nepalese children 0-12 months old. We already said, well, when we first established the idea of linear regression with a continuous predictor, we said, we can treat height is continuous if it makes sense. If there’s evidence that the relationship between arm circumference and height is relatively linear. And we said a useful visual display for assessing the nature of the association between two continuous variables is a scatterplot. So here is a simple scatterplot of the unadjusted association between arm circumference and height. And we looked at this in the first lecture we did in this term of the course. And this is the same scatterplot with the regression line, the resulting regression line superimposed on it. So we can see that well, of course, no fit is going to be perfect, and we tend to overestimate the individual values. The mean tends to overestimate the individual values on the lower end. Generally speaking, it splits the points down the middle, and looks like it tracks with the center of with the arm circumference measures the functional height. So it seems to be a pretty good assumption to exploit here to get this estimated association. So now suppose we fit the multiple linear regression model we've looked at several times that includes not just height but also weight as a continuous predictor and age is categorical. And we look at some other models as well and we could do the same treat as for those but we'll just use this an example. So the resulting regression model we got by fitting height and weight is both continuous. We got a slope for height of -0.09 and a slope for continuous weight of 1.32. But regards to linearity assumption in between arm circumference and height now is a little more complicated than height was the only predictor in a simple model. Now, the assumption regarding height in this model is that the relationship between arm circumference and height is linear after adjusting for weight and age So how are we going to get a sense of that? How are we going to assess whether the relationship between arm circumference and height is linear after adjustment? Well, we can look at that simple scatterplot between arm circumference and height that we started with. But that doesn't take into account weight or age, so it's not going to give us a picture of what we're looking for. So another option, this is actually something unique to linear regression and we'll not have the luxury of this when we get into other types of regression. So we can create something called an adjusted scatterplot. In this case, if we wanted to assess what the relationship between arm circumference and height was linear in nature after adjusting for weight and age. We could create this graphic which actually shows the relationship between arm circumference and height. Where both have been adjusted for weight and age. In other words, it actually plots the variability and arm circumference not explained by weight and age versus the variability in height not already explained by weight and age. So it looks to see what the nature to the relationship between arm circumference and height is. And even if there is any relationship left over after we've explained all we can about both of them with the other variables. So, here's what it looks like when we get the computer to give this to us. And if anybody wants to talk about the mechanics of this, I am certainly happy to do so. It's beyond what we'll cover in the class, where this comes from. But I'm happy to discuss it for those interested. Here is the adjusted variable plot for height. I asked the computer to give me the adjusted variable plot for height after adjusting for weight and age. Generally any computer package you use will present what is called an adjusted scatterplot and plot these are actually residuals. Residuals of arm circumference already regressed on weight and age against the residual's height already regressed from weight and age. So the part of the variability in arm circumference that wasn't really explained by weight and age, against the variability in height that wasn't explained by weight and age. So the units don't work so much by the units, but look at this as a scatterplot. Unfortunately the computer tends to put the adjusted regression line on the picture. So sometimes that's good for guiding our eye. Sometimes it can trick us into seeing linearity when it maybe isn't there. But nevertheless it will be part of the graphic. But what we can see in this picture I think, is pretty reasonable satisfaction of the linearity assumption between arm circumference and height. After adjusting for weight and age it looks like this is a reasonable fit. And the slope of this line is in fact the slope of height from the multiple linear regression model include weight and age as predictors. So if we saw something that looked more like this. For example, when we did the adjusted variable plot and there's, this is kind of a ridiculous over simplification. But the points had some kind of non-linear pattern and the best fitting line was like this, we might say well, we could fit a line to this association but it wouldn't be optimal. We might consider going back refitting the model that included weight and age and categorizing height to better capture that non-linearity. Another thing we could do to asses it empirically and not visually is categorize height into arbitrary groupings like quartiles. And then look at the respective slopes for quartiles with increasing height, and see if the change between quartiles. First to second, second to third, those differences are relatively similar and consistent in direction. And that would be a way to empirically look at this linearity relationship in a model that's adjusting for other things. We'll look at that in the additional example section. Just FYI, since weight was continuous as well, we want to have some sense of whether we should keep weight in as continuous, after adjusting for height and age. Here from that same model that included height, weight, and age is the adjusted variable plot for weight. This plots the part of arm circumference. That wasn't explained by height and age against the part of weight that wasn't explained by height and age similar to what we saw with the adjusted variable plot for height. And I think this provides relatively compelling evidence the association is, after adjustment, is roughly linear as well. So, let's talk about adjusted variable plots in general. So suppose just for general generic logistic regression equation. One wishes to assess whether the relationship, the adjusted relationship between an outcome y and a continuous predictor I'll just call xi, which is generically is linear nature for multiple regression model. The form y-hat equals B-naught-hat plus B1x1 plus B2x2 up through pxs, B-hat pxp. The adjusted variable plot for our continuous predictor present to us, this first one which we generically call xi. Where i is somewhere between 1 and p, plots the variability in y not explain by all other xs in the regression model. Against the variability in xi not explained by all other xs in the model. So we looked at just one example with results, but this principle can generally be constructed for any multiple linear regression model. So the only reason, I point this out here is so we won't focus on doing this so much this class and I won't ask you to generate these. But you can't do them by hand anyway but we may look at some of them in the context of an example. But I do want you to think about this, because sometimes persons who are not so familiar with the underlying structure of linear regression will go ahead and run models where they throw in xs that are continuous. Without ever looking to see if the relationship they're modeling especially complex models, where there's a lot of adjusters and other predictors, is well met. And so, if you read a paper where they're using a lot of continuous predictors and never comment on whether they assess the linearity assumption, it always makes me a little nervous. So I just want you to think about that whether you're a part of a research team, and suggesting analytic techniques, or whether you're reacting to research that others have done. So a lot of times when doing regression and analyzing data, a question comes up, how with all these possibilities for predictor sets. How do we choose a quote, unquote, final linear regression model? In many cases, there is a desire to choose a final, best model. And it is hard to define what a best model is, and it certainly depends on the goals of the research. So I'll just talk about, very briefly, some different approaches for different goals. So if the goal is to maximize the precision of our adjusted estimates, we want to estimate the associations between outcome and predictor. Taking into account several predictors at once in a multiple linear regression model, this process might say keep only those predictors that are statistically significant in the final model. In other words, don't carry things that aren't adding information to the outcome above the other things in the model because having to estimate that dead weight, so to speak. Those things that aren't contributing will compromise our precision on and lead to inflated standard errors for the other predictors in the model. If the goal is to present results comparable to the results of similar analyses presented by other researchers, your colleague had studied arm circumference and height in Nigeria. And they presented a final model that include height, weight, and age. Even if, for example, age was not a driving predictor of your results after accounting for height and weight. You may want to present results that have also been adjusted for age so that what you're results are comparable in terms of what they mean scientifically. If the goal is to show what happens to magnitudes of association with different levels of adjustment, and when we first looked in arm circumference and predictors, we did this. We could present the results from several models that include different subsets or combinations of predictor variables. To see, and maybe we keep our eye on the relationship between arm circumference and height when we adjust for different characteristics. Let's look at sex here, let's look at age now. Let's look at both sex and age. Let's bring in weight, etc. Or maybe we look at what happens to systolic blood pressure and its relationship to ethnicity after we account for other things. First just demographic variables, then add in some other personal characteristics from the data. So sometimes there's knowledge to be found by looking at more than one multiple in your regression model, and comparing what happens to one or more estimates across the models. If the goal was prediction, we want to model that best predict arm circumferences for children who are not used in creating the model. Or best predict systolic blood pressure for persons who were not part of the dataset using some other characteristics. It's a slightly more complicated story, in fact is is a much more complicated story. But we will discuss briefly how to do this from linear regression and what the issues are to consider both specific to linear regression and also generally to prediction. So let's suppose we wanted to use one of these models, these multiple linear regression models. To relate arm circumference to predictors, so that we could predict average arm circumference for groups of children who were not used to fit the model. So how well would this model predict for observations not used to fit the model? How can we measure that? Well, one thing we can do is use the R squared. As we said, that quantifies the percent of variability in the outcome explained. And we can extend this in multiple regression to explain by all predictors in the model. But there's a catch to that. There's a flaw in R squared. R squared increases with additional predictors regardless of whether these add information about the variability in your outcome. So if I wanted to make my model for predicting arm circumference look better, I could add junk to the model, things that have nothing to do with arm circumference. I could generate a column of random numbers. And put that in a one random number per child essentially would have nothing to do with the outcome. But the R squared would still go up in value. So we can make things look like better models predictive flaws if we add even extraneous or miscellaneous ancillary things to the model. So there's an attempt to fix this, there's a quantity for multiple regression models that's reported side-by-side with the regular R squared called an adjusted R squared. And this is generally what should be reported instead of the original R squared. This recalibrates the R squared, and only increases it for additional predictors that are actually informative. So it's a more realistic picture of the overall predictive power of the model on this data set and penalizes the original R squared for that flaw. So if we had things that add a lot more information about the outcome above and beyond what's in our model. Then the adjusted R squared mean looks similar to the R squared. But if we have things that are not really telling us more about the variability above and beyond what else was in the model. Then the adjusted R squared could come in lower value than the original R square. The other problem is that R squared is usually reported without confidence limits. Even though it's a sample-based estimate. So even though there's uncertainty in the R squared estimate, it's unfortunately usually purported as the truth. Even though there can be a fair amount of uncertainty in this measure. But more generally than the flaws with R squared for assessing prediction, say, in linear regression is just the concept of prediction. Prediction is complicated. For any type of regression, whether it be linear or other types the measure of model prediction, so for linear and [INAUDIBLE] R squared or adjusted R squared. When they are evaluated or computed using the same data that was used to fit the model they're overly optimistic. The R squared or adjusted R squared in a multiple linear regression case, Evaluated from models fit with the same data will tend to overestimate the predictive power of that model. Because the model was built for those observations. So these values can be used to rank the relative prediction of competing models, all fit to the same data set. But if we want to know or estimate how much variation, say in arm circumference, explained by these other characteristics in the general population of Nepalese children, we would be getting an overestimate of that if we took those R, or an adjusted R squared values that I reported from the models we saw on that table. So what can we do about this? Well if we really wanted the prediction as our goal, if the goals to both choose a model because it has the best prediction among estimated models, and to evaluate how well it predicts for observations not used to fit the model, then cross-validation should be performed after the best model is chosen. So this is a nice thing to do if there's enough data to do it. You've probably heard a lot about the science of machine learning. That's become a buzzword recently. And really machines learning is about model and nonmodel, sometimes not model-based predictions of an outcome given inputs. So a lot of machine learnings involve regression models. But this a step that machine learning takes in valuing how predictive the algorithm is, something called cross-validation. One way to do this is to split the original data you have randomly into two subsets. They can be of equal or nonequal sizes. And so one of them will be a training set and the other will be a validation or testing set. And so, We fit the different competing models using the training data, and based on metrics for choosing to find a model within those competing models fit on the same data set. We would maybe choose a final model, say this is probably the best model for predicting the outcome on observations not used to fit this model, that we can evaluate the prediction of this model based on how well it predicts for the validation sets. So we could take that other sample data that was not used to fit the model, and use the model we fit with the training data set to predict the outcome for the observations in our validation set and look at the differences between, or the variability of our individual outcomes around their predictive means and compute an R squared based on that. And that would be a better measure of the generalized predictability or predictive power of this model on sets of data not used to fit the model. So just something to think about. The prediction could be a whole quarter under its own so I'm just putting some concepts out here that are complicated. It would be fun if we had a third quarter of the course to focus on this. But I'm certainly throwing information at you about a complex process and just giving you the general idea. So in conclusion, the linearity assumption is in multiple linear regression is with respect to continuous predictors. It's not an issue to be concerned with for binary and categorical predictors. And that is to say because whenever our predictor's binary or categorical, each slope compares exactly to groups, the mean between two groups adjusted for other predictors in the model. And as such, we only have a difference in two means and not across a continuum, it's by definition linear. Adjusted variable plots are a nice visual tool in multiple linear regression for a visualizing outcome/predictor association, adjusting for other predictors. We could also do this empirically. And this is what we'll have to do in all other types of regression because we don't have the luxury of a visual tool. But we could categorize the continuous predictor and refit the linear regression that includes other predictors with the categorical version and see if the differences between ordinal categories are consistent with what we'd expect for a roughly linear relationship. And again, we'll look at that example in the additional examples section of this lecture set. The strategy for choosing a final linear regression model depends on the goal of research. Whether you want to maximize the precision of the estimates of an adjusted associations. Whether you want to have results that are comparable to the results from other person's or researcher's research, or whether you want to predict with the model. R squared is a predictive measure, for a linear regression is an imperfect measure of model prediction. The adjusted R squared corrects for one of its flaws, which is that R squared alone will always go up with additional predictors regardless of whether they're informative about the outcome above and beyond the other predictors of the model. The adjusted R squared does not do that. But in order to properly evaluate the predicted power of any regression model, not just linear regression, it is necessary to compute the prediction measure. So, for linear regression that would be R squared based on the regression model fit with data, evaluated on data not used to fit that model. And that will give a more honest overview of the prediction. So these are some conceptual ideas I felt that needed to complete some of the ideas we talked about. We can't explore these in as much depth as I'd like, given the term limitations of the course.