So in this section, we'll look at some examples of linear regression from the Public Health Medical et cetera literature. So what you'll get here are some opportunities to try to interpret the results from simple and multiple linear regression models presented in published journal articles. So let's go back to one we've used several times in this course, the article from Jama in 2012 on academic physician salaries and physician sex. The goal of this study was to compare differences in salary between females and males accounting for differences between females and males, with regards to their academic careers that may be related to salary as well. So, again they reported to start that the unadjusted mean salaries for the males and females in the cohort were $167,669 per annum for women, and $200,433 per annum for men. A difference of close to $33,000 less for women than for men, but this was the unadjusted comparison. What they found after adjustment for other factors including specialty academic rank, leadership positions, types and number of publications, and research time, is there was still a difference, and it was statistically significant, and it was still large should be and males on average made over $13,000 more on average than comparable woman in terms of those other adjustment factors. Still large but it was not as large as the initial unadjusted difference. So, some of the association was confounded or explained by some of these other predictors, they must be related to both the sex of the person and the outcome of salary. So let's just look at some text from their methods section. Took this directly from the article they say, "We limited the analytical sample to individuals who held MD degrees, were still affiliated with US academic institutions, and reported salary." So some people didn't respond in the survey they used with their salaries, and what they did initially was they tried to compare the non-responders to those who responded in terms of other characteristics to see if there was bias in their sample. Then they said from the resulting sample of persons who reported their salary, we described characteristics of the sample separately by sex, and we've looked at some examples of that throughout the course, and they construct multiple variable linear regression models for salary with the following respondent characteristics: gender, age, race, marital status, parental status, additional graduate et cetera especially pay level, current institution type onward and upward. Most characteristics this will sound familiar were categorical and model this indicator variables with a reference category. Continuous variable characteristics were included as well. They did something where they centered them at their means instead of reporting, for example, the age of the person they reported their difference between their age and the average age for all people. It doesn't actually change much about the regression, the only thing it can do is make the intercept more interpretable because a value of zero on a centered continuous variable means that when we evaluate the model when that predictor zero it means we're evaluating at the mean age for example but it doesn't change anything else about the slopes, or the relationships that are estimated. They say we construct a bot the full model using all covariates and a parsimonious model whereby we iteratively deleted variables from the model based on improvement in something called the Akaike information criteria. Something we didn't discuss in this class, but essentially what they did was they systematically removed non-statistically significant predictors from the model they started with which included everything until they got a subset of only those that were statistically significant. So that was their preferred approach to estimating the so-called final model was to only include predictors that were statistically significant. So, they presented the results from both those models, one that included everything, and one that only included those things that were statistically significant, and it spans several pages in this article because there were so many potential predictors of interests. So just taking it from the first page of table three which again goes on for two more pages, what they showed for their initial model, this is the model where they threw in everything and kept it regardless of it being statistically significant or not. I'm just pulling these coefficients and slopes from this table here to translate it to a model of the form. You seeing their outcome here is average salary the intercept, and they provide that for the model the starting point is $136,064. Then the first predicted they report is they call it gender, it's biological sex, and they coded this as a one for male, and zero female, so the slope of this sex variable is $12,001. So when they adjusted for everything and kept those things in that we're not necessarily statistically significant the average difference they found in salaries between men and women comparable and all other factors was $12,000, and it was statistically significant. The next thing they included in this model that had everything was the race, and there were four categories for race, white was the reference, so they had indicators for Asian Pacific Islander with a slope of negative 472 Black African-American with a slope of negative 19,422. So that's the average adjusted difference in salaries for those who identified as African-American compared to their white counterparts. All other characteristics being the same, and then this was the slope for other those two didn't identify as the other three groups. Notice they report in overall p-value here not for the each individual difference, but for the overall construct that would come from the partial F test. Then they had other things such as age, and they kept that in this continuous et cetera. Notice that only in the first set of variables they listed here and again there are many more included, and some are statistically significant were kept in the final model, but only on this page the only predictor that was statistically significant and hence shows up in the final model column is that of sex or gender. So the final model includes the starting intercept of 166,094 plus the slope for gender of 13,399 that is the adjusted final adjusted mean difference between females and males adjusted for whatever else was included in the final model, and there are other things and other adjusted estimates, but we can't see them in this part of the table because again it spans multiple pages. Here's another interesting study. A cigarette pricing and infant mortality this was published in Jama Pediatrics in 2017. Here's the abstract it says the importance of this is raising the price of cigarettes by increasing taxation has been associated with improved perinatal and child health outcomes, transnational tobacco companies have sought to undermine tobacco policy, tax policy by adopting pricing strategy that maintain the availability of budget or low cost cigarettes. Objective of the study was to assess associations between median cigarette price differentials, and infant mortality across the European Union. So this is a longitudinal ecological study conducted from 2004 to 2014 on the infinite populations in 23 countries. Longitudinal because this was conducted over time. Ecological because they had summary measures on each of the countries and not information at the individual child level. So, what they did is they looked at median secret prices and the differential between these and minimum cigarette prices in each of the countries in each year. These differentials were calculated as the proportions to advise the difference between median and minimum cigarette price by median price, and they were adjusted over time for inflation. Their outcome here even though it's in a mortality rate, they're doing a linear regression, which is not the common way to approach this. This would generally be done as a Poisson regression, but quite frankly the estimates we get will now not be on the log scale, and they can estimate differences in incidence rates as opposed to just the ratios. At first pass, we saw with poisson regression we could expodentiate the results and get incidents rates for different groups. Here they start with them and don't have to transform them, but interestingly enough the only thing by doing this is either estimating on a different scale, but they are making a slightly different assumptions about the uncertainty in these outcomes. So, I actually think it would be preferable to do this with Poisson given the nature of the outcome data, but certainly, we can get an interpret estimates of rate differences from linear regression. So, what they said in their statistical analysis was the main outcome of our analysis was the rate of infant mortality, infant deaths per thousand live births. Consistent with previous research on this topic they fit a linear fixed effects panel regression model with robust standard errors. This just has to do with the fact that they're looking at repeated measurements on the unit of observation which is country over time and there may be some shared information there that's beyond what we've talked about in the course, but the standard errors are still interpretable as we have in all other situations. The panel just means we have summary statistics, incidence rate ratios, and other summary measures collected at the country level over time. In our study, the unit of analysis was the region from the country in the period of 2004 to 14. So, we have a situation where we don't have individual level data but just summary measures on each country, 23 different regions across ten years. What they're showing us, what happened, what the trend in infant mortality rates was over time graphically in these 23 different units. Actually, different countries in Europe over time, and they show. Sort of scatter plots over time, they just connect the dots for each of these countries and then they contrast that with their main predictors of interest which is the median and minimum price of cigarettes in those countries over time and in fact the difference between those two things proportion wise over time as well. So, this was a nice visual display showing some of the variables in action and there's the dimension of time here and this is done across 23 different units of observation or countries. So, what they did then is they present the results from unadjusted or simple linear regression models and multiple linear regression models side-by-side, so we could potentially look at confounding in these data. One of their predictors of interests was the median cigarette price and in Euros per pack and they wanted to look at the impact of that on infant mortality within the year that the price was set and a year, we wanted to look at the infant mortality as a function of the previous year's price as well because there may be a lag. It may take a little while for things to happen. So, in the unadjusted situation, they show that for both these measures, the infant mortality decreases with increasing median both in the year of the increase and a year after as well. So, those slopes are negative and they are statistically significant when they adjust for other characteristics, the results are still negative and statistically significant but they attenuate a bit. And they go on to look at this for other factors as well. So, I'm just presenting the results of the adjusted model here. I did not put in and what I should put in here is there's definitely an intercept which I'm not showing and which we cannot get from this table, but I will put this up here. But this is the slope here and from the adjusted model for median cigarette price for the same year that the measurement was taken, this is median cigarette price for the prior when the measure was taken. One thing they should do since these events were unfolding over time, this mortality is there could be temporal trends that have nothing to do with cigarette pricing. So, they actually did include time per calendar year to tease out and adjust for changes in infant mortality across time that may be independent of these other things changing over time with regards to cigarette prices, GDP per capita in Euros, unemployment, etc. And so, they actually didn't model that they took this as a continuous but they included a linear term for estimated linear association the quadratic term, so ultimately what they allowed for was a relationship that was non-linear as a function of time. So, unfortunately, again, that's a little beyond the scope of what we can really parse in this course, but it can be done. There's some trade-offs to it, though, this just nicely potentially for the time, but it is very hard now to interpret the slopes for time when there's also a squared version of it in the model. So, here again I'm just showing the regression model, the adjusted model translated from this table to a regression equation format. Unfortunately, they didn't provide the intercepts so we can't use these results in this table to estimate for a given country given its cigarette pricing characteristics and other country level characteristics its infant mortality. So, let's see how they explain this because this is a little bit complicated. We've got median. We've got differential between median and minimum, so they take on the adjusted here in their results part of their abstract and they say among the 53 million plus live births during the study period, an increase of one euro, or at the time, $1.18 in US dollars per pack and the median cigarette price was associated with the decline of 0.323 deaths per 1,000 live births in the same year, and the decline of 0.16 deaths per 1,000 live births in the following year. So, they're pulling off these two pieces here to talk about that. An the increase in the 10 percent and the price differential. So, these units they used was not a one percent increase, these slopes are presented per 10 percent increase, an increase of 10 percent and the price differential between median priced and minimum price cigarettes was also associated with an increase of 0.07 deaths per 1,000 live births in the following year. But it was not and they don't report the piece in the current year because ostensibly it wasn't statistically significant or they could have mentioned it. So, what they are claiming is that the cheaper the potentially available cigarettes are, the more difference or variation there is in cigarette prices around their median. Meaning lower prices on one end, the higher the infant mortality rates even after accounting for the median price alone and other country level characteristics. So, they say cigarette price increases across the 23 European countries between 2004 and 14 were associated with fewer infant deaths and they use this to estimate the number of deaths that could have been avoided if the cost differential between median and minimum price shrunk to zero. So, there was less variation and hence lesser availability of cheap cigarettes in the country. Let's look at one more example from Health Affairs. Just to read the abstract here, "Consolidation of physician practices has intensified concerns that providers with greater market power may be able to charge higher prices without having to deliver better care compared to providers with less market power. Providers have argued that higher prices cover the cost of delivering higher-quality care. We examine the relationship between physician practice prices for outpatient services and practice quality and efficiency of care. Using commercial claims data, we classified practices as high or low price. We use national data from the Consumer Assessment of Healthcare Providers and Systems survey and link claims for Medicare beneficiaries to compare high and low price practices in the same geographic area in terms of quality utilization spending." So, this is a lot of work managing data here to link up these different databases. "Compared with low price practices, high price practices were much larger and received 36 percent higher prices. Patients of higher price practices reported significantly higher scores of some measures of care coordination and management, but do not differ meaningful in their overall care ratings other domains of patient experiences including physician ratings and access to care, receipt of preventive services, acute care use or total Medicare spending. This suggests an overall weak relationship between practice prices and the quality and efficiency of care and calls into question claims that high price providers deliver substantially higher value care." So, how do they do this? Let's talk about what they did. So, they compare the performance of high versus low price practices on measures of quality, utilization spending. We estimated a patient level linear regression model of each measure as a function of whether the patient was attributed to a high or low price practice. And then they looked at provider and patient level characteristics as well. So, what they did was they looked at multiple linear regression models to estimate average difference in the quality and utilization between high and low price practices located in the same area controlling for area level factors that affect quality prices and demand. To facilitate the interpretation of differences in experiences between patients of high and low price practices, we calculated effect size by dividing differences on the original survey scale by the standard deviation of between area variation. So, for things that don't have a straight up interpretable physical interpretation, it's all relative these satisfaction scales, they standardize them so that they were comparable across different areas for different facilities. So, what they showed here, they didn't show the results, the full regression model results because they looked at a bunch of different multiple regression models with the same set of predictors but with different outcomes. So, they had one of them was rating of healthcare and this is the slope estimate for high versus low price. This is the average adjusted difference in rating of health care between high and low price facilities. This is adjusted for effect size so standardized by variability in this across different facilities but what they found was that this difference was very close to zero and when they put a confidence interval on it, of course it included zero, this vertical line up here is zero. So, for the most part if you look across here, they're just reporting the slope a confidence interval of high versus low. So, an X variable that was a one for high price and zero for low, that data and it's confidence interval from multiple regression models that include the aforementioned adjustment variables they talked about in the methods section. And they do this across models with different outcomes. Here's one timely access to care and for example, the difference indicates slightly greater timely access in high-price facilities but it's not statistically significant. So, they can see some things do actually come out better for high-price facilities like waiting time in the office less than 15 minutes but many of these measures there's no substantial scientific or statistical difference. And here is some more explanation of how to interpret these differences given they've been standardized. And so, for things like this it's always important to read the method section in fine print to get a sense of what the scale they're reporting on is, but I think the message here comes across consistently with what they started with that generally speaking there were not meaningful or statistically significant differences in these outcomes between the high and low price facilities who were otherwise comparable on the adjustment factors. So, hopefully these three examples has given you some more insight as to how linear regression is used in public health and science in general and how to approach and think about how these can be presented in journal articles as well.