Now, we'll look at comparing means between two populations in the unpaired situation and we'll see it's exactly the same as with a paired situation except for how we estimate the standard error for the difference in sample means, and we've already seen that difference before when we looked at confidence intervals for mean differences. So in this lecture set, you will learn how to estimate and interpret a p-value for a hypothesis test of a mean difference between two populations for the unpaired or two independent groups study design. Other method for getting the p-value is called the unpaired t-test or the two-sample t-test. It's called unpaired because of the study design and it's called t-test again because sometimes the sampling distribution of the mean difference, the estimated mean difference is a t distribution. Conceptually, the approach is exactly the same as with the paired test. Computationally, the only difference is how the standard error for the sample mean difference is computed. So, let's look at our example with length of stay by age at first claim from the Heritage Health data, and we've already seen this both in terms of showing how to estimate the mean difference and putting confidence limits on it. We've seen that when we broke out that sample of patients into whether they were greater than 40 years old in the year they were hospitalized or less than or equal to 40 years old, the mean length of stay for those who were greater than 40 years old was 4.9 days compared to 2.7 days for those who were less than or equal to 40 years old. So, a mean difference of 2.2 days greater for the older group and the 95 percent confidence interval for that difference went from 2.04 days to 2.36 days. So, the result was statistically significant, and that the mean difference, the confidence interval for the mean difference did not include the null value of zero. So, we already know in advance that the Confidence Interval did not include zero. So, we're going to get a statistically significant result as well with the hypothesis test. Meaning, our p-value will come in at less than 0.05 but let's go through and see what the p-value is exactly. So, we're going to set up the two competing hypotheses just like we did before and they're the same, it's just the study design is different. So, in this context here, the null hypothesis is the mean length of stay, for those greater than 40 years old is equal to the mean length of stay for those less than or equal to 40 years old in their year of visitation at the hospital, and we could re-express this, and that the mean difference between the two populations based on their age is the mean difference in length of stay is zero. We can shorthand that by saying mu diff is equal to zero, and the alternative for these comparisons is that their means are not equal and that the mean differences are not zero. So, as with before, we're going to assume the null is true, that the means are equal, and that the mean difference is zero, and figure out how far our observed mean difference of 2.2 days is from the expected mean difference of zero under the assumption of the null hypothesis being true, and we're going to figure out that distance in standard errors. So, we're going to take our observed mean difference, figure out how far it is from zero and divide that by the number of days in this case in the standard error. So if we do this, we get a difference of 2.2 days as I said before. Recall the standard error for a mean difference in unpaired samples as we can't combine the two samples into a single sample of differences. So, the standard error of the mean difference for the two samples is combination of the uncertainty in each of the two samples that the means are based on. We computed this before and it turns out to be 0.08 days, very small standard error because our sample sizes are relatively large. We get a result that is 27.5 standard errors above zero, the mean, the true mean difference we'd expect under the null hypothesis. So again, we get a result that's off the charts here. We'd expect most of the results we could get by chance, mostly the estimates of this null difference of zero to be close to zero within plus or minus two standard errors under this because the CLT tells us the sampling distribution is approximately normal. So, if we were to replicate this study over and over again with the same number of patients in the under 40 and over 40 groups and look at the distribution of sample means, we'd expect them to cluster closely to the truth. We're assuming the truth is zero, we get a result that is 27.5 standard errors above what we'd expect it to be under the null hypothesis. I can't even draw this, it's way in the tails here. So we know, well, it's more than two standard errors away so the p-value is going to be less than 0.05, but it's way less than 0.05. If you do this using the pnorm function in R. So again, if I do pnorm 27.5, it will give me the proportion of observations that are less than 27.5 standard errors above zero. So, all to the left, I want the proportion that are more than 27.5 standard errors above zero, with this tail here. So, I'm going to take one minus pnorm of 27.5, and because I want the tail, the cumulative probability in both tails that are as far or farther than 27.5 standard errors from zero, I'll multiply that by two and I get something on the other 1.76 times 10 to the negative 166 power. So this is essentially almost zero. If I wanted to make this easier, remember, I'm looking at the chances of being as far or farther than 27.5 standard errors in either direction. If I didn't want to subtract my pnorm results from one, if I put in pnorm of negative 27.5, I'd get the proportion that were more than 27.5 standard errors or more below zero and I can multiply that by two to get the same result. So, the resulting p-value is very, very small. So how do we interpret that? What does it measure? Well, it says if, and remember, it's computed under the assumption that the null hypothesis. So, if there were no difference in the population mean length of stay between persons over than 40 years old and portions 40 years or younger, then the chance of observing a mean difference of 2.2 days or something even more extreme, and remember that means, more extreme means less likely, is extremely small for a study of our size. Very low chance of getting that result. So, we have to make a decision, our resulting p-values is way less than 0.05. So, in terms of choosing between the null hypothesis and the alternative, we'd say our result is very inconsistent with what we'd expect to get under the null that we assume to be true for calculating the p-value, so we reject that in favor of the alternative. So, we have ruled out no difference in length of stays between the two age groups as a possibility for the underlying truth. So, no difference really to be technical in average length of stays. But again, if we think of the average of these distributions being similar shape, if the average is same the distributions of the length of stay would be very similar at the individual level as well. So, our decision here is totally consistent with our 95 percent confidence interval which did not include the null value of zero. So, when we solve the confidence interval, we knew the result was statistically significant at the 0.05 level but we just didn't know what the p-value was. All we could say when we saw the confidence interval is that it's less than 0.05. Now, we can say it's, well, very a lot less than 0.05 once we've gone through and done the hypothesis test. Again, just like with paired differences, same goes for unpaired differences, the p-value is completely invariant to the direction of comparison. If we computed the mean length of stay in the opposite direction looking at those less than 40 years, less than or equal to 40 years to compare those greater than 40 years, the difference would be negative 2.2 days as opposed to positive 2.2. Standard error is still the same for this difference. It's 0.08. So, in this direction, our result is 27.5 standard errors below zero instead of above zero. Again, zero is what the population mean difference is assumed to be that we're assuming under this null hypothesis. So again, we get a result that ultimately, we compute a p-value based on the result being as far or farther than 27.5 standard errors from zero in either direction. So again, it doesn't matter whether we measure. It's 27.5 standard errors above zero in one direction or 27.5 standard errors below zero in the other direction, we'll still get the same p-value. Let's look at another example, a low carbohydrate as compared with the low fat diet in severe obesity, another example we know well. This is a study where they randomized 132 severely obese subjects, were randomized to one of two diet groups, subjects were followed for a six month period and based on the study authors, I'm pulling this from their manuscript, subjects on the low carbohydrate diet lost more weight than those on a low fat diet, with a 95 percent confidence interval for the difference in weight loss between groups of negative 1.6 to negative 6.2 kilograms p less than 0.01. So we see that zero is not in the confidence interval, and we know from that, that the corresponding p-value will be less than 0.05. Here, they're saying it's actually less than 0.01 even smaller than just less than 0.05. Their scientific question was, Is diet type associated with weight change? Here are the summary statistics on this few samples. We see that those in the low-carb group you may recall lost 5.7 kilograms on average. Those in the low-fat group lost 1.8 kilograms on average. They both lost weight at least in the samples but the question was, is the difference in weight loss real or statistically significant? So, we set up our competing hypotheses, mu here represents the mean weight change and the null is that the mean weight change for those who got the low-carb diet is equivalent to the mean weight change who got the low fat diet or that the difference in means is zero. Again, we could reverse the direction of that instead of doing mu low carb minus mu low fat, mu low fat minus mu low carb. If the means are equal, the difference is zero regardless of direction. The alternative is that the mean difference is not zero. So, we're going to again go and assume that the population level, the mean difference is zero. There's no association between weight loss and diet type, the mean weight changes the same regardless of diet type at the population level. We'll figure out how far our observed mean difference is from the expected mean difference of zero in terms of standard errors. Again, the distance measure just like a z-score we used to do for the beginning of the course is just our observed mean difference divided by its standard errors. So, we get a result when all the math is done here, and you may recall and this is the formula for the standard error of this mean difference and it was 1.17 kilograms. In this direction, we get a result, a difference of negative 3.9 kilograms meaning that those, the low-carb group lost 3.9 kilograms more than those low-fat group. It's 3.3 standard errors below zero. And again, zero was what the true mean difference is under the null hypothesis. True mean difference in average weight change. So, let's translate this into p-value, you already know, it's where we'll be comparing this. This is more than 60 in each group, we can compare it to a normal curve. We've got something that's more than three standard errors below the mean zero of the normal curve. So you already know it's less than five percent of the observations under the sampling distribution would give a mean difference of something that was 3.3 standard errors or more away from the center of the distribution of all possible mean differences when the true mean difference is zero. So, we translated this in standard errors, and our p-value is the probability of getting a result as far or farther than 3.3 standard errors from the mean of standard normal curve with mean zero. If we do this using the p norm command, it's 0.00096, or nearly 0.001. So, a small p-value, and again, we knew the p-value would be less than 0.05, given the confidence interval here we actually quantify it more precisely than just saying it's less than 0.05. So, this resulting p-value is very very small. So, interpretation wise, if there were no difference in the population mean weight change for low-carbohydrate diets compared to low-fat diets, then the chance of observing a mean difference of negative 3.9 kilograms or something more extreme is less than 1 in 1,000. So because this p-value is less than 0.05 by our conventional cutoff of 0.05 or five percent, we'd reject the null hypothesis, that the true mean population differences is zero in favor of the alternative that there is a difference. If that's all we knew again, that the result was statistically significant, P is less than 0.01, we don't have a lot of information. We just know that the mean difference, we're concluding that the mean differences in weight change are different between the low-fat and low-carb group at the population level but we don't know anything about their magnitude or even the direction. But if we have the estimated difference of negative 3.9 kilograms and the confidence interval, we have a lot more information. But again, this decision to reject really out zero as a possibility is consistent with our 95 percent confidence interval for the true mean difference which also does not include zero. So just to note regarding this two-sample t-test. The test I am showing you with the formula for the standard error e.t.c. is formally called the two-sample t-test assuming equal population variances. You may remember variance and standard deviation of the values, individual values in our samples are related to each other. So, another way to think of this is as the two-sample t-test assuming unequal population standard deviations, assumes that the variability in individual values in the populations that we're comparing are not the same. So, if you were to really do the traditional textbook approach to doing a t-test, there's another t-tests you could do that assumes that the underlying population standard deviations are equal. So, if you were to take the traditional textbook approach before you decide whether to use two-sample t-test assuming equal population standard deviations are unequal, you'd have to do another hypothesis test to compare the population level standard deviations of your two populations based on your sample estimates. So again, we can only estimate the population standard deviations for two populations. By using our sample standard deviations, we'd have to formally test, based on those results, the population level standard deviations were equal versus that they are. So you'd have to do a hypothesis test before doing a hypothesis test. Well it turns out that first of all, the the test for comparing standard deviations or variances is not a very good test. It only works under certain conditions. But, I'll show you in a minute that if we use the approach for unequal variances, or unequal standard deviations, it's much more robust and we can use that as our go-to test and not have to worry about choosing between the two. There's slightly different. The modification for unequal variances, the test we're using adjusted degrees of freedom for the test, that's only an issue in smaller samples, and the computer will generally handle it anyway, uses a slightly different standard error computation than the other tests, and the standard error computation is the one I gave you. If you want to be truly safe and you only could pack one, I call it the desert island, choice of t-tests because if you can only pack one t-test in your suitcase, and we're going to be hanging out in the desert island for a while, you may as well just pack the one, use the one for unequal variances. It's more conservative to use the tests that allows for unequal variances when the underlying variances or standard deviations are the same but we'll still get a valid result. In general, it makes little to no difference in large samples. So you might say well, why should I take this test again? Well, here's what occurs. If the underlying population level standard deviations are equal, both approaches give valid confidence intervals and p-values, but the intervals by the approach assuming unequal standard deviations are slightly wider, and the p-value is slightly larger than they would be by the tests that appropriately assumes equal population level standard deviations. We'll give up a little bit of precision if the underlying truth is that the population level standard deviations are the same. But if the underlying population level standard deviations are not equal, then the approach assuming equal variances breaks down, and it does not give valid confidence intervals, it can severely under-cover the goal of 95 percent, and the p-values are way off the mark. The p-values are wrong. So, in terms of robustness, it's not a very robust test under situations where the assumption of equals population standard deviations is not correct. So, for a one size fits all tests relatively robust, I suggest using the one we've been using that assumes unequal variances or standard deviations to a population level. So again, how do we do this? This is exactly the same approaches with the paired situation. We set up two competing hypothesis about the unknown population means, and these can be expressed in any one of three different, all these three sets are saying the same thing. Then we assume the null is true, compute how far our observed estimate is, our sample estimate of the difference is, from the expected difference of zero. Under the null, we measure that difference in terms of standard errors, translate that distance into a p-value, and make a decision. The p-value will be based on normal distribution or the t-distribution with smaller samples. Generally, you want to do these things by hand, and by hand I mean, using R to compute a p-value based on summary statistics. Generally, you can feed the data to the computer and it will handle the details. The p-value measures the chance of getting the study results or something even less likely or more extreme, when the samples are assumed to come from populations with the same mean. The p-value from a two-sample t-test, this is also as we said before called the two-sided p-value, was also with the paired situation, and we'll debrief on that in the next lecture set, is invariant to the direction of comparison. The only difference between the paired and unpaired or two sample t-test, is how the standard error of the sample mean difference is computed. So in this next lecture section for this set, we'll debrief a little bit on the p-value, talk about what it is and what it isn't. Then in the next lecture set, we'll look at doing hypothesis tests for comparing proportions and incidence rates between populations.