A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

138 ratings

Johns Hopkins University

138 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So, in the next sections, we'll look at the role of sample size, power and detectable difference of interest at the roles that these have on each other when designing the study. And I want you to understand the relationships be at, by these and what effects what in which direction. I'll be showing you the results from computer software I use, the statistical package stated to estimate the sample sizes and power under a variety of conditions. And I don't expect you to be able to replicate these results with stata or by hand, but I want you to appreciate the influence each factor has on each other. However, I am going to point you in the direction of a free application download that you can get, which is the sample size and power calculator. And it may be of interest to you to download this and play around with the inputs just so you can get further reinforcement of the relationship between these different quantities and the role they have on each other.

So, in this last lecture set for Statistical Reasoning 1, we're going to talk about designing studies to have a desired power, and the ideas are similar to what we did in Lecture 12. But this approach is more commonly used for studies that are designed to compare populations.

So in this set of lectures, the relationship between sample sizes and precision will be re-expressed through the window of study power.

It is more common to design a study to have a certain level of power, 80 or 90 percent, than for a desired margin of error, especially when we're comparing populations, but the approach is analogous. In these lecture sets, power and its influences will be explored, and some examples of designing a study to achieve a certain power level will be given.

So, this has a very sexy title, Power and Its Influences. But I'm only [LAUGH] allowed, or qualified perhaps, to talk about statistical power and its' influences, not power in general. So let me give you an example of a study with low power, and we'll refocus on the idea of power in this lecture. So consider the following results from a small study done on 29 women, all between the ages of 35 and 39 years old. So a random sample of 29 women was taken from a clinical population, and then the women were classified as to whether they were currently using oral contraceptives or not at the time. And so eight of the 29 women were using oral contraceptives at the time as compared to 21 who were not. And the researchers measured their blood pressures, and then wanted to make a comparison between the blood pressures of those on oral contraceptives, at the time, to those who were not using oral contraceptives. And so here are summary statistics on each of the two samples. The average blood pressure among the oral contraceptive users was 132.8, as compared to 127.4 in the other group, and there's estimates of the standard deviation based on the sample results.

So, what the researchers were particularly interested in looking at with the study is whether oral contraceptive use is associated with higher blood pressure.

So, statistically speaking the researchers were interested in testing the null hypothesis, that the underlying population level blood pressures between the women using oral contraceptives and not are equal, versus the alternative that they're not. Or, as we like to express things in terms of differences, the null is that the mean difference at the population level is zero, versus that it's different than zero.

And so again here are the study results, and ultimately what came out of this was the sample mean difference in blood pressures is 5.4 millimeters merc, mercury. 5.4 millimeters mercury higher, on average, for the women who are on oral contraceptives. That's a, a sizable difference, but, of course, this is based on very small samples, and if they actually did the 95% confidence level, you see it goes from negative 8.9 millimeters of mercury, all the way up to 19.7 so that's very wide and inconclusive. It includes the null value zero, and the corresponding P value is 0.43. So the decision here, in hypothesis testing, would be the ambiguous fail to reject the null, and it's especially ambiguous because this study is small.

And it's not clear whether this is because there is, the null is true or because there was so much uncertainty in the data that we couldn't see differences based on such small sample size.

So suppose you, as a researcher, were concerned about detecting a population level difference of this magnitude, on the order of five millimeters of mercury on average, if it truly existed. Well this particular study of 29 women had low power, to detect a difference of such magnitude. It's chances of detecting a difference of at least 5.4 millimeters of mercury, were it really the truth of the population level, was low.

So just to remind us what power is, and I just sort of alluded to it in the last slide. Recall the table comparing the underlying truth that we can't observe to decisions made by a hypothesis testing, so we're very familiar with this first situation. If the null is true, but we end up deciding to reject the null, we've done the wrong thing. We've made a Type 1 error, and that's our alpha level of the test, the level we are willing to tolerate for that.

However, if we reject the null, when the alternative is true, that's a good thing, and the chances of doing that for some alternative for a given study is called the power of the study. For a given study of a given sample size, what are the chances of finding a significant difference for some specified alternative difference value given the size of the study? So power is a measure of doing the right thing when the alternative hypothesis is the truth that generated our samples. And certainly for a study higher power is better, but it comes at a cost. [BLANK_AUDIO] So why is higher power better? Well, with higher power studies, going into it, it's been designed such that if there is a real difference at the population level, the study of that size has good opportunity to see it. And we end up failing to reject the null with a high powered study, it's much clearer to go with the idea of the null being the underlying truth because there's not that ambiguity about our inability to find a difference, did it really exist?

So when a study with low power finds a non-statistically significant result, it is hard to interpret this result as I said before, it's ambiguous, we don't know whether we failed to reject the null because the null is true, or because we just had so much noise or uncertainty in our data that we couldn't see differences. When a study has high power, a non-statistically significant result can be interpreted more confidently as no association, which is an important finding in research. So,

just to give you an example of the study power for the one we just looked at, the oral contraceptive blood pressure study has a power of 13% to detect a difference in blood pressure of 5.4 millimeters or more between the oral contraceptive users and the non oral contraceptive users. If the difference truly exists in the population of women that were sampled. So, in other words, study is based on only 29 women from this population. Only about one in ten. A little more than one in ten or 13% would actually pick up a difference in the population level of 5.4 or more if that's really the truth.

So this means that our study had very low opportunity to see a difference of perhaps substantive interest, if it were really the case at the population level. So where does power come from? How do we actually compute it?

Well I'll give you the idea behind it. So recall, something we worked on a lot in this coarse. The sampling behavior of estimates comparing two samples (mean differences, or risk differences) or the log of estimates comparing two samples when their ratios is normally distributed in large samples with the sampling distribution centered at the true difference of interest.

So whether it be a mean difference, a difference in proportions, et cetera. So under the null, our null hypothesis with regards to these differences is that the difference is zero. If the null is the truth then this curve sampling distribution is centered at the truth of zero.

So for designing a study to have a certain power, or estimating the power of a completed study, we have to be specific about the value of our alternative, and this is where it gets a little trickier. When we do hypothesis testing, our alternative is very vague, it's just that the difference of interest is not zero. But in order to actually compute power or design a study to have power, we have to be more specific and talk about alternatives that we'd be interested in seeing, put a lower, bound on the difference that would be, substantively interesting.

So if we're doing a hypothesis testing comparing two groups, as we know, the null and alternative are, no differences in the populations from which the two samples came, versus the very vague there is a difference. [BLANK_AUDIO] [SOUND] So in order to actually look at power, of an existing study or design study to have a certain power, we have to actually get specific about a minimum difference that we are interested in seeing as a researcher. So for example in the blood pressure oral contraceptives study, I may not be interested if the difference between oral contraceptive users blood pressure and those who didn't is on the order of half a millimeter of mercury, or one millimeter of mercury, cause that isn't very clinically significant. A small shift up. I may only be interested in seeing differences if they are at least in the order of four or five millimeters of mercury. So I don't want to actually spend the resources to see smaller differences. There's a minimum idea of what would be scientifically interesting. And I have to specify that in order to look at the power of a study or compute a new study to have a certain power.

So let's just talk, briefly. I'm going to draw some cartoons, and then I'll animate them, so that you don't have to suffer through my drawing skills throughout this entire lecture. So let's look at what we know about from sampling distributions. We know that forget most of the differences we can look at. If we looked at them across multiple studies of the same size like a mean difference or difference in proportions or the law of relative risks, et cetera. The estimates if we plotted them in a histogram would be normally distributed and centered at the truth. And if our samples come from populations with equal measures where the difference is zero, if the null is the truth then this sampling distribution will be centered at zero. So I'm just going to redraw that, so under the null our sampling distribution of our estimates there'll be some variability, but it'll be around zero. However, if there's another truth out there, the null is not actually the truth, the alternative is true that there's some difference, and now we're going to specify what some difference could mean. We'll say there's some difference, and we'll call it d. D could be one millimeter mercury, or 10%, or some specific number. Then if that were the case, what we'd really have, behind the scenes, is the alternative is true and then the sampling behavior of our estimate would be normally distributed. It'd be the same curve here, but it would be centered,

So, what does this mean for us? We are going to make a decision, to reject the null or not, based on this first curve. We're assuming the null is true. This black curve describes the sampling behavior over estimates under the null, and we're going to make a decision to reject at the five percent level, if our estimate from our study comparing the two samples, from the two populations is outside of two standard errors from our null value of zero.

So that, if it's not more than two standard errors away we will not reject, but if it is more than two standard errors away we will reject.

So, what is power? Well if in fact our data actually comes from populations with a difference at least as large as d, one we'd specify in advance, and we'll get to where that comes from. Then, then our power is, the probability rejecting based on this black curve, distribution of the null, when in fact,

this curve describes our sampling behavior because the alternative is true. So this area in here is the probability of getting a result that's more than two standard errors away from zero when the samples come from a population where the actual difference and the quantities being compared is this alternative value d.

Suppose we have designed a study, and looked at, what the power was under design, and decide we want the power to be larger. What could the researcher do to make the power larger? In other words, what could the researcher do to increase this area

Well one thing they could do, and I'll just click this slide here to try and animate it is, they could actually make the expected difference larger, alternative hypothesis value bigger. Make it further from zero which makes it more likely,

Another thing the researcher could do, if they didn't want to mess with the difference and make it larger, they already had it about as large as it could be,

and making it any larger would incur missing some differences of interest, they could actually increase the sample size in each group. And what effect would that have? Well that would reduce the uncertainty in the estimates, and it would make those curves tighter, and so they'd be easier to distinguish between, and the blue area or the power would increase, because of the decrease standard error around our estimates.

The last thing that a researcher could do is make it easier to reject. Increase the alpha-level of the hypothesis test, functionally speaking, make it easier to reject. So here is our picture, with the five percent rejection level. [BLANK_AUDIO] But if we increase that, what we're going to do is, increase the region under the black curve where we would reject a null. And that's going to, in turn increase the proportion, or chances of doing so under this blue curve. Now practically speaking, what do you think is acceptable versus not, in the world of research? It's okay to toy around with the difference. It's okay to play around with the sample sizes, but changing the alpha level is not practical, because, most funding agencies would not accept a study designed to have power with a rejection level of greater than five percent. And consequently, most journals will not be happy with papers that are submitted under that design, based on a rejection level of greater than five percent.

So for example, this is done sometimes with smaller studies to try and understand why a non-statistically significant difference was found, and to see whether low power is an issue which might open up an opportunity for someone to build on the research and do a larger study. But power can only be computed for specific alternative hypotheses.

For example, with population mean differences the this study had X percent to detect a difference in population means of Y or greater. So in order to compute the power of a study that's been done, one would have to specify the minimum difference, in the measure of interest between the two populations that the study was trying to detect. And so you'll sometimes see this presented as an excuse for non-statistically significant findings if the low power with low, with the power is low. So the lack of statistically significant association between A and B could be between because of low power. Maybe less than 15% could detect a mean difference of Y or greater or a difference in proportions or whatever the measure is used.

It can also be presented to corroborate with the non-statistically significant result. In other words, to try and understand what the reasons for that may be.

The industry standard for power going forward and designing a study is 80% or greater, so sometimes if a smaller study is published with low power another researcher will say, well, the results look interesting in the small sample study. I'd like to design a bigger study with power of 80% or 90% to look at the same comparison and answer the question.

So what we're going to explore in the next couple lecture sets is many times in the study design, a required sample size is computed to achieve a certain preset power level defined a clinically or scientifically minimal important difference in means, proportions, or incidence rates, or ratios. And again the industry standard for power is 80% or greater. And we'll see that going into this, this is a little bit of a game,

because the power of the study to detect the difference between populations on the appropriate measure of interest, is a function of the size of the study samples and the minimal detectable difference of interests. So when designing a study in advance, researchers need to incorporate these elements into design while recognizing practical considerations such as budget and personnel. So if the first attempt, a design in power, study to have power to find a certain difference yields really large necessary sample sizes that are out of the funding range, the researcher needs to go back to the drawing board, perhaps consider making the minimal detectable difference of interest larger, to increase the power without

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.