So, in this section, we'll just talk about the basics of model estimation, we'll extend least squares from what we did in simple linear regression to multiple regression, and also just talk about how to handle uncertainty in a resulting multiple regression estimates. So, again, we'll extend the concept of least squares to the estimation of multiple linear regression models, compute 95 percent confidence intervals for the intercept and individual slopes. Hope you'll at the end of this, understand again how to perform a hypothesis test for individual slopes. You could mechanize it if you needed to, but again, a computer takes care of that detail, and understand the concept of the "partial F-test" that allows for testing multiple slopes at once in the context of a regression model, and it's useful for testing a multi-categorical predictors in a linear regression context. So, the algorithm to estimate the multiple regression equation is called the "least squares" estimation, just like we saw with simple linear regression. The idea is the same, just extended into multiple dimensions, just to find a line, or actually multidimensional object like a plane or beyond, when we have more multiple x's on the right-hand side, that gets "closest" to all points in the sample. So again, how are we going to define closeness to multiple points? Well, to just extend the idea that we had for simple regression, in multiple linear regression, closeness is defined as the cumulative squared distance between each observation's y-value and the corresponding value in the regression object for that observation's values x_1 through x_p. In other words, the squared distance between observed y-value, and then, estimated mean y-value for observations with the same value of x_1 through x_p. So, this least squares approach is used to estimate the slopes and the intercept for a specified regression equation on a given dataset. The algorithm chooses the values of the intercept, and the slopes that minimize the total sums of squared residuals, the distance between each individual y outcome value and the corresponding mean, given all the x's for that particular observation in the sample. The equation looks complicated here, but it's just an extension. We minimize the distance between each individual observations observed outcome value, and the mean predicted by the multiple regression equation. So, we minimize this cumulative sum of squared distances, and this can be done algorithmically via calculus. But just give me your heads-up in case you're interested, but this method also gives us standard errors for the intercept and slope estimates. So, just to think about this with more than one predictor, the linear regression model, and again, there's no longer estimating a line in two-dimensional space. So, for example, if we have a regression with two x's, independent x's, the shape being described by the regression equation is a "plane" in three-dimensional space, and for more than two x's, we can't even visualize the resulting shape in a single graphic. It goes beyond three dimensions. So, again, this least squares algorithm also give standard error estimates for the intercept and slopes, and these standard errors allow for the computation of 95 percent confidence intervals and p-values for these slopes and intercept. Just like we've seen with all other regression quantities, slopes and intercepts that resend in random sampling behavior regression slopes and intercepts is normal in "large samples", and you would need to appeal to a t-distribution smaller samples, but we're not going to worry about that detail when doing things by hand and the computer will take care of it. So, again, for any single intercept or slope estimate, it's "business as usual". We're getting 95 percent confidence intervals and doing hypothesis tests. So, for example, if we wanted to get a 95 percent confidence interval for the true population level intercept, we could take our estimated intercept plus or minus two estimated standard errors. There we may or may not want to do this depending on whether the resulting intercept has relevance scientifically, even though we need as a place holder, we may not be that interested in this confidence interval, and for any slope, beta i, where i runs from one to p, where p is the number of x's, we just take our estimated slope plus or minus two estimated standard errors of that slope. The p value for any of the slopes is business as usual, what this test is whether our x, our given x associated with that slope is a statistically significant predictor of our outcome, after adjusting or accounting for the other x's in the model. Just as before, the null is that the individual adjusted association at the population level zero versus the alternative that it's non-zero. So, what we need to do is, as we've done with all hypothesis tests from the start, was assume the null is true, assume there is no association, assume that adjusted slope is zero, and calculate the distance of our slope estimate from zero in units of standard error. So, we would take our specific slope, beta hat i, and divide it by its standard error to get that standardized distance, and then translate that into a p-value. So, let's go back to our predictors of arm circumference example. Remember we looked at several models side by side. So, I'm going to focus in on some results from model two here, and just look at where these confidence intervals came from. So, this is the model, the standard error of the slope for height for example, is 0.03, the standard error of the slope for weight is 0.10. Those were beta one and beta two respectively, so let's go ahead and create a confidence interval with these and we'll show that, or where these confidence intervals in that table came from. So, the 95 percent confidence interval for the slope of height in a model also includes weight and age. We take the estimated slope of negative 0.09, plus two estimated standard errors given as 0.03, and that gives us a 95 percent confidence interval for this adjusted association between arm circumference and height, negative 0.15 to 0.04. Similarly for the relationship between arm circumference and weight adjusted for height and age, the estimated adjusted association slope is 1.32, when we add and subtract two standard errors, we get a confidence interval that goes from one 1.12 to 1.52 as presented in the table. The p-value for, let's just look at one of these slopes just to show the process to remind you of it, pretesting whether the adjusted association between arm circumference and height, represented by beta one, whether that's zero at the population level or not. The null, we will start with the null that it is zero and measure the distance of our result, which was negative 0.09 from zero. So, that's just negative 0.09, divided by the units of the standard error of 0.03 and we get an estimate that was negative three standard errors below what we'd expect and so that's relatively far from the center of a corresponding sampling distribution. So, we know the p-value would be less than 0.01. We can use the computer to get the exact p-value if interested and that would give that to us if we fit the regression model with the computer. Let's look at now, we also had in this model two that we're talking about, we had age as a predictor and age was multi-categorical. So, how or what it conceptually where does the p-value for testing age collectively by testing all three slopes rates in one test come from. So, as noted in the first lecture section, when a predictors multi-categorical, and, hence, is modelled with multiple in xs in order to test whether the predictor is statistically significantly associated with the outcome, it's not enough to test each slope individually for reasons we talked about in that first lecture section. So, in order to formally test whether H is a statistically significant predictor of arm circumference when we had three H categories and estimated three, we had four H categories and had to estimate three mean differences or slopes between each of the non-reference categories and the reference, which was the youngest group in order to formally test whether H is a statistically significant predictor of arm circumference, we need to test the null that the three slopes of H are all zero. So again, why test all three slopes together? Well, it could be just hypothetically, it may be that Beta three is not the difference between age group two and age group one, adjusted for weight and height, is not statistically significant and the difference between age group three and age group, the same reference, age group one is not statistically significant and the difference between age group four and the same reference is not statistically significant. That's possible, and if we just looked at the confidence intervals and p-values for these three three, we might conclude that age wasn't an important predictor of arm circumference after accounting for height and weight, but we are missing other comparisons. We're missing the comparison between age group three and three group two, age group four and age group two, and age group four age group three, simply because of our arbitrary coding. So, it's possible that one or more of those differences is statistically significant. So, we want to test if all three differences are zero, then all differences between differences are also zero. So, you can catch that even though it may look at face value that these three differences were technically, we would not reject the null of them being zero, there is also hidden comparisons here. The actual alternative for this is that at least one of these slopes and really this is, it looks like we're restricting it to one of the three specific comparisons we've made, but it's really at least one of the group differences, that's a better way to put it, not just the ones we've modeled by our choice of coding is zero. So, this test more broadly for any differences in age, mean arm circumference between age groups after adjustment. How does this work? Well, the test that does this is called a partial F-test and we're not going to do it by hand. I just want to give you some insight as where it comes. What it does is it compares the amount of variability in y, in our case arm circumference explained by weight, height and age to the amount of variability in arm circumference explained by weight and height only. So, in this framework, we have two models where one is nested in the other. The model includes the more predictors called the extended model, the one with height, weight, and age. Then we're comparing to what's sometimes called a null model, if we're testing specifically whether H is a statistically significant predictor, the null model would only include height, weight and would not include age. We were testing this extended model compared to the null and if we choose the null model in favor the extended, we're concluding that we don't need age, we don't need those three extra slopes. If we choose the extended model in favor of the null, we're including the age as information above and beyond height and weight. This needs to be done with a computer. The approach is generalizable to any two null and extended model setups. So, we don't just need to test one predictor at once, but it's setup is such that the null has been nested within the extended. In other words, the extended model includes everything that's in the null plus additional predictors. There can't be predictors in the null that are not in the standard model. The partial F-test generally tested all slopes for the additional predictors in the extended model equals zero. So, what this test does is it compares this how much extra, it asked the question, does the expanded model explain enough extra variability, in this case arm circumference compared to the null model to justify having to estimate three more slopes to model age with the same amount of data? There is a price to pay to estimate more slopes with the same amount of data if those predictors model by those extra xs if they don't add extra information to the outcome, they were estimating things we don't need to and that's going to take away for precision the other things. So, this is basically asking, do we explain or increase our understanding of the variability in the outcome enough to justify estimating more slopes with the same amount of data. It's like our return on investment indicator. The end result of this is a p-value which we can interpret in context, the null and again the computer will do this. So, in summary, when we're looking at individual quantities and intercept for slopes, individual slopes, the construction of confidence intervals for multiple linear regression slopes and intercepts is business as usual, as well as getting the same as for getting p-values, in smaller sample, we add and subtract, take our estimate and add and subtract two estimated standard errors. In smaller samples, we didn't do this by hand and there's no reason to, but I'll just remind you that in smaller samples we have to be a little more conservative and we get the number of standard errors to add and subtract to get 95 percent coverage from a t-distribution with n minus p plus one degrees of freedom, where p is the number of xs in the regression model. So, the degrees of freedom is the total sample size minus the number of things we have to estimate. We have to estimate one intercept and p-slopes. Again, this detail is handled by a computer. I just want to make you aware of it so that if you see confidence intervals for multiple regression where it's more than plus or minus two standard errors, it's because of a slight correction for the smaller sample size. Confidence intervals for slopes are confidence intervals for adjusted mean differences in multiple linear regression. Confidence intervals for intercepts are confidence intervals for the mean of y for a specific group. The group with all xs equal to zero, not always relevant when at least some of our xs are continuous. Formally testing multiple categorical predictors requires testing two or more slopes together as opposed to individually. This can be easily done by hand or eyeball by hand like we could for individual slopes, but this can be done by using a partial F-test. I'm just making you aware of the name of the test as you may see it referenced in articles. The resulting p-value tells us whether the multi-categorical predictor is a statistically significant predictor of y above and beyond or after accounting for the other predictors in the model.