A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

42 ratings

Johns Hopkins University

42 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Welcome back. In this section, we'll talk a little bit about how the computer estimates the linear regression equation, given a set of data. And we'll also deal with accounting for uncertainty in our slope intercept estimates via confidence interval creation and hypothesis testing.

So hopefully, you'll appreciate after this section that creating confidence intervals for linear regression slopes means essentially creating confidence intervals for mean differences. And the approach is business as usual. We take our estimated slope and add and subtract two or sometimes a little bit more standard errors. And if we want to get a p value, the approach is the same as well. We start by assuming the slope or the mean difference is zero, and then looking at how far our result is from what we'd expect under that null hypothesis. Similarly, creating a confidence interval for an intercept is akin to creating a confidence interval for a single population mean and follows the logic we used in Statistical Reasoning One.

So let's take a look at our arm circumference and height example again to start. So in the last section we showed the results from several simple linear regression models, including this one with arm circumference and height. And we estimated all we gave was the resulting estimated regression equation based on these 150 data points that suggested that the mean arm circumference was related to height via the following equation. Take 2.7 and add 0.16 times the group of children's height to estimate the mean arm circumference for that group. I got this from a computer package, but how does the algorithm work to estimate this equation?

Well there must be some sort of algorithm that will always yield the same results for the same data set, regardless of what computer package we use to estimate it.

So the algorithm to estimate the equation of lines is called least squares estimation. And the idea is to find the line that gets closest to all of the points in the sample. The line that estimates the means that

have the least variability around those estimates. So how can we define closeness to multiple points?

Well, in regression, closeness is defined as the cumulative square difference between each point's observed y-value and the corresponding estimated mean, y-hat for that point's x-value. In other words, the squared distance between an observed y-value and the estimated mean value for all points with the same value of x. So each distance for each observed point in our data set can be estimated by taking that point's value, for example, that child's value of arm circumference. And subtracting the predicted mean of arm circumference for children with the same height. And so this distance on a scatter plot looks like this vertical distance between each point and the mean for children with that height value that is shown on the red regression line.

So the algorithm to actually estimate that regression line, again, is called least squares, because it minimizes the overall squared distance between all points in that line. And so what the computer does is given the data, it chooses the values for the intercept and the slope that minimize the cumulative distances squared. So if we were to actually take the values of beta not and beta 1 hat that minimize the cumulative, we took each point in our data set.

Each child's arm circumference subtracted the predicted mean via the regression equation. We square that distance and add it up across all data points in the sample. The algorithm chooses the value of the intercepting slope that minimize that cumulative square distance. And so the algorithm doesn't have to keep trying different combinations of beta not and beta 1 until it finds the one that gives the minimum square distance. This can actually be done pretty easily using a calculus-based approach to minimize this function, choose the values of beta not and beta 1 hat that minimized this function. The end result of this minimization give us what sometimes are called closed-forms equations. Equations which we could use to solve for the optimal values of beta not hat and beta 1 hat in terms of the x and y values in our data set. But I would never expect anyone to do a regression by hand, in fact, I've never done a regression by hand because the computations are arduous and time consuming. However, the equation is very cool and makes for a nice piece of apparel as evidenced by the fact that I actually have it on my tie. The end result, however, are estimates based on the data we have at hand, and these are just estimates based on our single sample from our population. So if we were to actually have different random samples from the same population. For example, different random samples of 150 Nepalese children from the same population of Nepalese children less than 12 months. We might get different estimates of beta not and beta 1 depending on the sample we used. So in other words, the values that minimize the cumulative square distance for different samples of the same size would likely be different across the samples. So there's some sampling variability in these estimates. So all regression coefficients, the intercept and slope, have an associated standard error. That can help us make statements about the true relationship between the mean and y and our x predictor based on a single sample. So there is a true regression equation in the population that has a true slope and a true intercept. We can only estimate these quantities. So just like we've done with everything else that we estimate, we're ultimately going to have to deal with the uncertainty in these estimates.

So let's again go look at the estimated regression correlation relating arm circumference to height based on this one sample of 115 Nepalese children. And again, here's our equation. But actually the computer will give us the resulting estimated standard error for our intercept and slope. So for example, the slope was 0.16 and the estimated standard error is 0.014. So it turns out, remember, these are mean differences ultimately, these slopes and the intercepts are means. And so the random sampling behavior of these estimated regression coefficients is essentially the random sample behavior for mean differences in means. Which we've already showed that is generally normal from sample to sample and centered at the true value that we're estimating. So we can use the same ideas we used back in Statistical Reasoning One, for creating 95% confidence intervals, for the true underlying population level slopes and intercepts, and to get P values. So let's look at the estimated regression equation for this height in Nepali children. So this slope here 0.16 estimated the mean difference in arm circumference,

That was just the estimate. If we actually create a confidence interval, the approach is the same old same old. We take our estimate, it's a mean difference, we add subtract two standard errors.

And we get a confidence interval of 0.13 centimeter difference in arm circumference to 0.19 centimeter difference in arm circumference, per 1 centimeter difference in height. We could also test whether the true population level association or mean difference in arm circumference Per unit difference in

height was zero or not. So our null hypothesis is that this true population level mean difference or the slope is zero and the alternative is that it's not zero. So we'll do this the same way we've always done hypothesis testing, we'll assume our sample comes from a population where the true slope is zero. And then we'll measure how far our result estimate is from zero in standard error units. And so if we do this, we get a slope that's 11.4 standard errors above what we'd expect to see under the null hypothesis. So, translating this to a p value means getting the probability of being 11.4 or more standard errors away either above or below from mean of 0 on a standard normal curve. And the p values very low. We already knew it would come in at less than .05 if yeah think about it. because the confidence interval for the slope did not include 0. But it's quite low, it's less than 0.001. So how could we write this up? We could say something like this research used simple linear regression to estimate the magnitude of the association between arm circumference and height in Nepali children less than 12 months old, using data on a random sample of 150. A statistically significant positive association was found, we could put the P value in parentheses. The results estimate that two groups of such children who differ by one centimeter in height, will differ on average by 0.16cm in arm circumference.

In other words, it's an increase in arm circumference, with an increase in height. And a 95% conference interval, which gives a range of possibilities for the true mean difference in arm circumference per 1 unit difference in height, in the entire population of such children, goes from 0.13 centimeters to 0.19 centimeters. What if I wanted to give an estimate in a 95% confidence interval, for the mean difference in arm circumference for children 60 centimeters tall compared to children 50 centimeters tall? Well, from the previous lecture section, we know that this estimated mean difference can be expressed in terms of the slope by taking the difference in our x value, which is 10 centimeters, or 10 units, and multiplying it by the estimated mean difference in y per one unit difference in x. So, the estimated mean difference in arm circumference per 1 unit difference in height was 0.16 centimeters. So, if the differences in height is 10 centimeters, this would accrue to a cumulative difference of 1.6 centimeters on average. But how do we actually get the standard error for this mean difference for more than a one unit difference in our x value? Well it turns out anything we do to our slope we do to the standard error. So if our resulting comparison yields an estimate of 10 times the slope estimate, we would take the standard error for the slope, and multiply it by 10. So the standard error, in other words, the estimated standard error of 10 times the slope, is equal to 10 times the standard error of the slope. So the standard error of ten times beta one hat ten times the standard error for beta one hat of 0.014 centimeters. And that turns out to be 0.14 centimeters. So 95% confidence interval for the mean difference in arm circumference for these two groups of children who differ by 10 centimeters in height is the estimated 1.67 centimeter difference in average arm circumference, plus or minus 2, times, that's standard error, of 0.14 centimeters. And if you do this out, we get a confidence interval of 1.32 centimeters to 1.88 centimeters. So that interval describes our uncertainty in the estimated mean difference in arm circumference between two groups of children who differ by centimeters in height.

Recall our hemoglobin and pack cell volume example, where the estimation regression line relating mean hemoglobin level to pack cell volume was given by this equation. The average hemoglobin level is equal to the intercept of 5.77 plus 0.2 times pack cell volume measured in percent.

So how are we going to compute a 95% confidence interval for this slope? Well, this is exactly the same idea as we just saw. But this sample was only 21 subjects. So in order to get a confidence interval and p-value, we're going to have to go slightly more than plus or minus two standard errors to get our confidence interval. And we'll have to compare our resulting difference between our estimate and the null value, not to the standard normal curve, but to a t-distribution, with n- 2, or 19 degrees of freedom.

And again, I'm not going to ask you to do this in a testing situation, or if I did, I would give you this value. The computer will handle this, but it's just nice to remember that in smaller samples, we have to be a little more conservative. So if we did this and we actually went to a t-distribution, or let our computer do the work for us, the number of standard errors required to get 95% coverage in the middle, of the middle values, in a t-distribution with 19 degrees of freedom, is 2.09. So, in order to get, this confidence interval we take the estimated mean difference in hemoglobin per 1% difference in packed cell volume add and subtract 2.09 times the estimated standard error of our slope, which is 0.046. And we get a confidence interval that goes from 0.1 to 0.3 grams per deciliter per 1% difference in packed cell volume. So notice that that confidence interval does not include 0. So we already know this result will be statistically significant at the 0.05 level.

However, if we wanted to get the p-value for testing the null, that the true slope of packed cell volume in the population from which the sample was taken is 0 versus the alternative that it's not 0. We'll again assume the null is true, assume the true slope is zero, that our sample comes from a population where there's no association between hemoglobin and packed cell volume. We look at how far our estimated slope of 0.2 is from 0 in terms of standard errors, and we get something that's 4.35 standard errors above what we'd expect under the null. So the resulting p value is the probability of being 4.35 or more standard errors above or below what we'd expect under the null, but we're referring to get this to a t-curve with 19 degrees of freedom. Nevertheless, in this example, the p-value comes in very low at less than .001. So, the estimated slope is 0.2 with a 95% CI for 0.10 to 0.30. So how can we interpret these results? We can say, based on a sample of 21 subjects, we estimated that packed cell volume is positively associated with hemoglobin levels. And we could put the P Less than 0.001, if we wanted to. We estimated that a one-percent increase in packed cell volume is associated with a 0.2 grams per deciliter increase in hemoglobin on average.

Accounting for sampling variability, this mean increase could be as small as 0.1 grams per deciliter or as large as 0.3 grams per deciliter in the population of all such subjects. So that brings in the confidence interval to express our uncertainty in how much that mean difference in hemoglobin is per one-percent difference in packed cell volume.

In other words, we estimated that the average difference in hemoglobin levels for two groups of subjects who differ by one-percent in packed cell volume is 2.2 grams per deciliter on average. And accounting for sampling variability, this mean difference could be as small as 0.1 grams per deciliter or as large as 0.3 grams per deciliter in the populations of all such persons. So what about the the intercepts? So Paul and I've showed you how to construct confidence intervals and do hypothesis testing for the slope from linear aggression and multiples of the slope. We can also create confidence intervals and get p-values, although they won't be that useful for the intercept, in the same manner, and Stata and other computer packages will present this in the output they get from regression. However, as we've talked about when X1 is a continuous predictor, many times the intercept is just a placeholder and does not describe a useful quantity or a quantity of relevance to our data. As such, 95% confidence intervals are not always relevant. However when our predictor is binary or categorical, the intercept may have a substantive interpretation and a 95% confidence interval at least, may be of interest. So let's take a look at an example of that.

So you recall, our analysis that we did in stat reading one, and we just did as a linear regression in a previous section here, of length of stay by age at first claim among the subjects from the Heritage Health Study. And when we regressed average length of stay on an indicator of whether the person was less than 40 at first stay or greater than or equal to 40, we got a slope of -2.1 and intercept of 4.9. So we interpret the slope as the estimated mean difference in length of stay for persons less than 40 at first claim, compared to persons over 40 and that was -2.1 days. The younger group had average length of stays of 2.1 days less than the older group. And the intercept actually had meaning in this analysis. It was the estimated mean length of stay for persons over 40 for their first stay in 2011, their first claim.

So we can get confidence intervals and p-values for both these quantities. So the slope, we estimate the mean difference between the younger group and the older group to be 2.1 days less for the younger group. But after accounting for uncertainty, and I should've put these in the proper order.

After accounting for the uncertainty in your estimate, this is the 95% confidence interval for the true mean difference in length of stay, for all patients in 2011. You can see it's rather tight because this was a large data set and it indicates that it's on the order of two or more days. If we did a hypothesis test of whether the true association was zero, in other words there was no association between length of stay and age of first claim The p value is quite low. We know that it would come in at less than 0.05, because our 95% confidence interval did not include zero, but this adds some specificity to the discussion. If we did a confidence interval for the slope, the estimated mean, like the stay for those who are over 40, for their first visit in 2011 was 4.9 days. And this confidence interval has meaning. it goes from 4.8 days to 5.0 and it expresses our uncertainty in that estimated mean. So we have a pretty strong, tight interval here that suggests the true length of stay on average was close to 5 days. Between 4.8 and 5 days for the population of patients that were over 40, when they entered the hospital in 2011. We could get a p-value for this, but it really doesn't make sense to test whether the mean length of stay for this single group is zero or not.

Because we know it can't be zero, given that our data set only includes persons whose length of stay was 1 or greater, so a p-value doesn't really add anything to the story here.

So in summary, the construction of confidence intervals for linear regression slopes is business as usual. Take the estimate and add or subtract two estimated standard errors, or slightly more in smaller samples. And we can also get a p-value by taking our slope estimate and converting it to number of standard errors that is above or below the null value of zero. And then figuring out what percentage of results we could get that were that far or farther, just by chance, if the null is true. So the confidence intervals we get for slopes and the resulting p-values are confidence intervals and p-values for mean differences. And the confidence intervals for intercepts are confidence intervals for the mean of y for a specific group or a specific population, the population whose x1 values are equal to zero. And as we've discussed, this is not always relevant or helpful when x1 is continuous. We can information to the analysis when our predictor is binary or categorical.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.