Greetings. In this section, we'll look at creating confidence intervals for comparing means of continuous outcomes between two populations based on the sample results for two samples from the respective populations. So in this lecture set, you will learn how to estimate and interpret 95% confidence intervals. For a mean difference between two populations under two types of study designs. Paired, and we'll define this explicitly with some examples. But when the two samples drawn from the populations under study are linked systematically and unpaired when the two samples are drawn from two independent unlinked populations. So let's with the paired sample, and we will see that conceptual approached clean confidential levels whether we have this pair design or unpaired is fundamentally exactly the same.It's just mechanic are slightly different. But this will give us a good starting point to think about what we are doing. So this was an example. Published in a Journal of Clinical Epidemiology back in the late 80's, when AIDS and HIV were becoming a big deal worldwide, actually. And they were trying to figure out how to measure the progression of the disease, and how to diagnose it, actually. And so, this was a study where the standard at the time was to actually have physician feel for the number of inflamed lymph nodes and make a diagnosis based on that. And so of course, the question of such approach to diagnosing a disease or a syndrome. Is this replicable, among different physicians? So this study was done to see whether it could be replicable or not. So two different physicians, they only use two to start assess the number of palpable lymph nodes in 65 randomly selected male sexual contacts of men who have been diagnosed with AIDS or AIDS-related condition. And they wanted to see what conclusions each of the doctors came to with regards to how many lymph nodes were palpable in these men. So each doctor examined each of the 65 patients and reported how many lymph nodes they found. And here are the summary results on each of the two doctors. So, Doctor 1, important average of 7.91 nodes but there was a fair amount of variability in his or her assessment. And Doctor 2, reported an average of 5.16 lymph nodes with similar levels of variability. And the average Difference if we go to the direction of Doctor 2 compared to Doctor 1 was -2.75 lymph nodes. So on average, on the same 65 patients, Doctor 2 found almost 3, but 2.75 lymph nodes less on average than Doctor 1. And the variability in the differences between these two doctors was 2.83 nodes. So let's just think of the study design data structure We pulled a little trick there. What we have here are two samples, From a population of men, sexual partners of men with HIV or AIDS. The two populations they represent are this population inspected by Doctor 1, and this same population inspected by Doctor 2. So the first, we have 65 men, each of them is examined by both. So the first measurement for male 1, on Doctor 1 we might call it x1 and we call it y1 for the same person on Doctor 2, x2, y2, etc. So what we could do because each man is inspected twice these samples are prepared for every male who is there is a measurement from doctor one. There is a corresponding measurement from Doctor 2. So immediately, we can actually reduce this set of data across two samples into a sample of differences, because of that structure that links the two data sets. So we could compute 65 differences, comparing the measurement for Doctor 2 to Doctor 1. And so now, we go from two samples of individual measurements from each of the doctors through a single sample of 65 differences, comparing the measurements from each of the two doctors. And so, we can actually take the mean of those differences. I'll just call it y bar- x bar, but we'll just maybe recast this as just the mean of the differences. And we can take the standard deviation of those 65 differences in the measurements resulting from Doctor 2 compared to Doctor 1. And so, what we end up is a single mean for the difference. And we can estimate the standard error from the difference based on the standard deviation of the 65 differences on each patient for Doctor 2 compared to Doctor 1 which was 2.83 divided by the number of pairs in our sample. 65, and so we create a confidence interval for the actual true, if these doctors examined partners of all men diagnosed with AIDS. What would be the range for the true difference in means? Take that observed difference of -2.75 + or- two standard errors. The standard error turns out to be 0.35 nodes, and we get a confidence interval from -3.45 to -2.05. So let's try and think about what this means. One way to interpret this is you can say, had all such men, all male sexual partners of persons, had all such men, all sexual partners of men diagnosed with AIDS, been examined by these two physicians, instead of just the 65 in the sample. The average difference in the lymph nodes, discovered by the two physicians, would be between -3.45 and -2.05. So notice that all the possibilities for the true mean difference are negative, 0 is not included in this interval. So, this study isn't really about Doctor 1 and Doctor 2, but it does set up a situation where if these two physicians can't agree on the number of lymph nodes as evidenced by the large difference in the discoveries they made on average. And the fact, that the confidence interval for what the conclusions they would reach had they both examined all men in the population of interest does not include zero. If these two physicians can't agree, and they disagree by relatively large differences, even in the best-case scenario, they're discovering, on average, a difference of more than two lymph nodes. Then it means this method is not reproducible, and this was a big insight in terms of advancing the science in clinical medicine to have better diagnosis for HIV and AIDS. Just therefore I, With any comparison we make, the difference is arbitrary if we had done the comparison of Doctor 2 compared to Doctor 1. Instead of the mean difference of -2.75 it would be the opposite of 2.75, because Doctor 1 found 2.75 more nos on average than doctor 2. And the confidence interval would still have the same absolute value of the end points, but they'd be in the opposite sign. So they'd be positive 2.05 to 3.45. So we'd still reach the same conclusion, just with the opposite values. The resut, again, is not about these doctors though, these two doctors per se. It shows that this diagnostic approach is not reproducible across different examiners. And the resulting doctor-to-doctor differences are not fully explained by sampling variability, as even after we accounted for the sampling variability via the standard error, we saw a clear difference. If the study had not found a clear difference, if 0 had been in the confidence interval, then the results would have been ambiguous. Mainly because, we couldn't prove that the result was reproducible just because these two doctors found the same result. But because just the two doctors couldn't come to a conclusion or an agreement that means that, if we were to repeat this across multiple doctors, we'd probably get more discrepancies and it sort of rules this out is the proper diagnostic technique. So another study, a pre-post study that was learning to in the first lecture section. This was done 10 non-pregnant, pre-menopausal women 16 to 49 years old, who were beginning a regimen of oral contraceptive use. They had their blood pressures measured prior to starting oral contraceptive, as I'll represent that with OC. And then three months after consistent oral contraceptive use. And the goal of this small study was to see what, if any, changes in average blood pressure were associated with oral contraceptive use in such women. And so the data on the following slide show the resulting pre- and post-oral contraceptive use systolic blood pressure differences for the 10 women in the study The data on the following slide show the resulting pre- and post-oral contraceptive use systolic blood pressure measurements for the 10 women in the study. And so, the first woman, her pre-measurement was 115 millimeters of mercury. After three months of oral contraceptive use, it went up to 128 millimeters of mercury, so she experienced an increase of 13 millimeters of mercury. Second woman, her measurement at baseline was 112 millimeters of mercury. After 3 months of oral contraceptive use, went up to 115, so she went up by 3 millimeters of mercury. The third woman had a slight decrease. So if you look across these ten measurements, you can see that there's variability in the magnitude of the changes, and that the majority of them and 8 of the 10 experienced an increase. So if we were to average these differences, so create this column of differences take the average. On average, any single woman increased by 4.8 millimeters of mercury, but there's a fair amount of variability in these 10 differences as we can see visually. And now it's measured numerically, the standard deviation is 4.6 millimeters of mercury, of those 10 differences. So in order to get a 95% confidence interval for the true mean change in blood pressures after oral contraceptives compared to before where all such women, the population of such women given oral contraceptives really use our sample results. The estimated mean change after contraceptive perform, and we're going to add and subtract a fixed number of estimated standard errors. Because this is a smaller sample, only 10 women, we would technically have to go to the t distribution with 10- 1 or 9 degrees of freedom to get the standard number of errors, we'd need to add/subtract to get 95% confidence. It turns out to be slightly more than 2, it's 2.26 standard errors. Certainly if we were doing this in the computer it would take care of that detail. And then the estimated standard error, we have here is based on the sample of differences. The standard deviation of those 10 differences, 4.6 millimeters of mercury divided by the square root of 10. And so if we do this out, we get a 95% confidence interval for the true average change in blood pressure of 1.4 milimeters of mercury up to 8.2. So all possibilities show an increase in the average blood pressure, but this interval's certainly wide, partially at least because we only have 10 women n this study. So it's a little hard to interpret scientifically on the lower end. A shift up of 1.4 millimeters of mercury on average may not be something to worry about clinically. But certainly on the upper end, 8.2 millimeters of mercury would be a pretty alarming increase. So, all signs point to an increase but it's not necessarily clear how clinically relevant the average increase would be. But certainly the result is statistically significant as 0 is not included in this confidence interval. But another interpretation issue with this such study is we only look at one group of women. All women get the exposure of oral contraceptives, so is their baseline measurement. Then they all get oral contraceptives, and they measure their blood pressure afterwards. It's not completely clear from this study design whether or not oral contraceptives are the driving factor in this increase. For example, during that same three month period, the seasons could change. And perhaps women, the women in the study, were less likely to exercise as it got colder, and ergo their blood pressure tended to increase. And that average change we're seeing is because of the lesser exercise over the three month period. But there could be some other reason that happened during that three month period independent of oral contraceptives. So, it would have made this a stronger study, where we could have a control group of women. We're not given oral contraceptives, but whose blood pressures were measured at the same two time points as the woman who got the oral contraceptives, the intervention group. And then instead of we've not only would be concerned about the change within each of these two groups. But the ultimate comparison of interest would be what's called an unpaired comparison, which we'll get to in a moment, would be the average change for those who got the intervention or oral contraceptives, compared to that change for those who did not get the intervention, those in the control. So let's look at an example of an unpaired situation. And we'll see that conceptually, the approach is the same. But how we deal with the data and estimate our standard error is slightly different. So let's look at our length of stay example from the Heritage Health claims system. We've already explored whether there were observe differences in the average length of stay for those who were greater than 40 years old, in our sample, versus those who were less than or equal to 40. And we saw before that there was a rather large mean difference. The mean length of stay for those who are over 40 years old at the time of the study was 4.9 days compared to 2.7 days among the group that was less than or equal to 40. And we have the standard deviation measurements for the length of stay values in each of those groups as well. So how is this data structured? Well, it turns out, There are roughly three quarters of the sample over 9,000 persons in this sample were greater than 40 years old, and then another 3,000 plus were less than 40 years old. So you can see right off the bat that the samples are not of the same size, which would be a requirement for them to be paired. We'd see the same number of observations in both samples. But what we have are two samples, Where we have older persons, we have over 9,000 older persons. And we have another sample, which is smaller, where we have about 3,000 younger persons. And there's no inherent known connection between this individual, the first person in the older sample, and any specific individual in the younger sample. Occasionally, you might get family members, etc., who both have been in the hospital in a given year. But generally speaking, these groups do not have anything to do with each other. And there's no way to connect any one person in the first group to any specific individual in the second. So these are completely independent samples in terms of our take on them and there is no connection. So that means we cannot boil the data down to a single, So in the first group, the older sample, there were over 9,000 persons. So, the greater than 40 year old group there were over 9,000 persons thus people and it was a sample. I'll just write it out like this. Older person one, older person two and older I only mean in this context. Not saying that 40 or greater than 40 is necessarily old. But relatively speaking it is compared to those who are less. And there equal to 40 in which there were some 3,000 plus persons. You have number one, you have number two. So I shouldn't have drawn it like this. Let me draw a cutoff here and x this out. Because I want it to make it clear that these samples are not the same size. Which would be perhaps the first condition for them to be paired. And aside from occasional situations which we're not aware of from the data given where maybe one of the persons who's older has a family member who's younger, that was also hospitalized generally speaking there is no connection between the 9,000 plus observations in the older sample and a 3,000 plus observation in the younger sample. And so there is no way to pair these things up. I can't connect this person with any one specific person. In the other sample in that such, I can't pull down this data to a single sample. Of differences, it's not possible. So I'm going to have to work with the information collected separately on the two samples. The good news is that I can still compute a mean difference. And it would come out. And I must say I should have said with a pair regardless of whether I did the mean difference. Before or after creating the summary measure of differences we get the same result. And here we can create the mean difference by taking means of the two samples the mean like to stay for the greater than 40 year old group was 4.9 days and 2.7 days. For the younger group for the difference of 2.2 days, but in order to estimate the standard error here, we can't boil these data down to one sample of differences and just take the standard deviation of the differences divided by the common sample size. So what we're going to have to do here, the formula is slightly different but what we'll be doing is basically the standard error of the difference in means is a combination of the standard errors of each of the two independent sample means. So we take the sample standard deviation of the values in the first sample, square it, and divide it by the sample size. And then add that to the standard deviation of the values in the second sample, divided by that sample size. And if you look at this, it's really, you could re-express this as s1 over the square root of n1 squared. That's just the standard error of the first sample mean squared. And we add that to the standard error of the second sample mean squared and then take the square root of that sum. Something we can talk about further in one of the course to a discussion forms or live talks is even though what we're taking a difference why is the uncertainty added to it. I want you to think bout that. So if we do this and apply this formula and go through all the math the standard error we ultimately get is about .08 days. So our margin of error on this mean difference is plus or minus 2 times .08 days. We can estimate the true mean difference based on study of the size within plus or minus .16 days. That's our margin there. If we do that we get a confidence interval for the difference that's 2.04 days to 2.36 days. Clearly this does not include zero. Zero is not in this interval. So what we see is a statistically significant larger length of stay among the older patients, compared to the younger, and it's on the order of two to two plus days. Even though there's a little bit of variability in these values, we can see that even on the lower end, it exceeds two days. But this was an interesting example. What if we wanted to compare more than two groups? What will we do in terms of confidence intervals? Well this approach would be generally the same as what we've done with mean differences with more than two groups, just adding any confidence intervals. We'd designate one as the reference group. And then create mean differences for each of the subsequent groups compared to the same reference and put confidence limits on that. So this was a study from JAMA Psychiatry. The abstract says, the importance of this study is that bright light therapy is an evidence based treatment for seasonal depression, but there's limited evidence for its efficacy in nonseasonal Major depression disorder called MDD. And the objective of this study is to determine the efficacy of light treatment in monotherapy in combination with fluoxetine hydrochloride compared with a sham placebo condition in adults with non-seasonal major depressive disorder. And so this is a four armed study. It was a randomized, double-blind placebo and sham controlled eight week trial in adults with MDD of at least moderate severity in outpatient psychiatry clinics. And patients were randomly assigned to one of four groups, they'd get light monotherapy, antidepressant monotherapy, a combination of light and antidepressant or a placebo. We describe all these parenthetically, but the point we are thinking of is there is four groups and of interest was the change on the Montgomery Asper depression rating scale from baseline to the eight week pinpoint. So if you think about it, the study design here, it is an unpaired study, but the measure is measured on each patient group before and after the respective treatment. So within each treatment group, we get a measure that's paired in a sense that we compare for each of the four treatment groups, the change after minus before. So we get four, Mean differences for each of the four treatment groups. But we're not interested in asking per se whether any one treatment group, how they changed. We're interested in comparing these differences across the four groups, four randomization groups. So are comparison of interest is ultimately unpaired because we're comparing results across four groups that are not connected. The four randomized treatment groups. And I thought this was an interesting graphic, this just shows if you can read it, there's a lot going on, but this shows the change from baseline. They measured these over the eight week period, they measured their depression scores intermittently. And this graphic shows the change in baseline over time for each of these four groups and you can see that all four groups experienced a decrease in depression scores over time. But these groups had more of a decrease than these two groups. And the group with the best outcomes here was the combination group. That put confidence, what they put on here are not confidence intervals but standard error part so we can't make a back to the envelope comparison even by comparing the confidence intervals for the resulting average depression scores leaves the four groups because these aren't full confidence intervals but we can report the resulting mean change as an average change in depression scores and 95% confidence intervals for each group treatment compared to the same reference. Or placebo. So the placebo experienced a decrease on the order of 6.9 points on that depression scale. And the difference in this change in the fluoxetine to the placebo group, was the fluoxetine group experienced a decrease of 8.8 points. A slightly larger decrease, their decrease was the difference in decreases for the fluoxetine to placebo's -1.9. Indicating 1 point more greater decrease if you will for the fluoxetine group compared to the placebo, so they went down by more than the placebo. We compared the light group to the placebo, they went down by even more. And the difference in resulting changes was -6.5 indicating more of a decrease was -6.5 indicating more of a decrease than the placebo group 6.5 more points on average for the light group. And then the combination of the two compared to placebo experienced the greatest decrease in general of -16.9 points on average for participants in that group. And the difference between their decrease, and the degrees of the placebo group was 10 points. It was negative because they experienced more of decrease. And we have the confidence intervals for each of these changes. And we can see that while the differences between the fluoxetine group and the placebo are not statistically significantly different because that confidence interval includes zero. The other two group show us statistically significant or equate a decrease if you will in depression score than the placebo group. So the last example on season depression is is a randomized trial such we can cleanly conclude the resulting differences are because of the investigation allocated interventions. Or if there were no differences shown, we could walk away and conclude that the intervention was ineffective. That the true difference was not being confounded or masked by other characteristics that differ between the subjects in the four groups as they were randomized. In non randomized group comparisons, we can still do the same statistics. But the substance interpretation will have to be done with the knowledge that other factors may also be confounding or may be part of the reason for the association or non association we find. So we'll keep that in mind moving forward. And again, just like we saw for single confidence intervals, for single quantities, for smaller samples slight corrections. We'll need to be made to the number of estimated standard errors added and subtracted to get 95% coverage but a computer will handle that as well. So in general, 95% and other level confidence intervals can relatively easily be estimated for mean differences between two populations for both paired and unpaired study designs. For paired studies with this paired sample data, the two samples can be tranformed into a single sample differences. And the standard error of the mean differences of function of standard error of the differences and the number of pairs. For unpaired studies, the two samples cannot be combined and the standard error is completely slightly differently. But neither case paired or unpaired are resulting to 95% confidence interval is the difference in means between the two groups. We could call this x bar div plus or minus 2 estimated standard errors of that difference. And it's only, That computation just depends on the study design. So I'll debrief on this, with an example in the additional examples part of this lecture set. But the only difference we've seen between paired and unpaired studies is how we complete the standard error of the mean difference. And you might be thinking, well, what if I ignore the paired study? And just computed the standard error of the mean difference, using that approach we used for unpaired comparisons. And I'll show you, we could certainly do that mathematically. But I'll show you in the additional examples that if we do so, and ignore the pairing in our data, we may end up with overestimating the standard error of our mean difference. Because we are double counting some of the uncertainty that's shared between those paired measures.