A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

136 ratings

Johns Hopkins University

136 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section, I will show the results for some computer simulations. And these will help us understand the idea of the sampling distribution. These demonstrations will show the resulting distributions of sample means across multiple random samples of the same size taken from the same theoretical population. These simulations are a tool to empirically demonstrate the difficult concept of a theoretical sampling distribution of a sample statistic. And this will get us started on that idea.

Okay now let's build on some of what we did before, and we're going to look at some examples of a sampling distribution of sample mean, using computer simulations and I'll explain what I mean by that, shortly.

So upon completion of this lecture section, you should be able to describe the sampling distribution of a sample mean in terms of its composition. We've already defined a sampling distribution, but hopefully this will reinforce what it means with regards to sample means. And then also be able to comment on some characteristics, or list some characteristics from the sampling distributions, or sample means that we've demonstrated empirically by the simulations in this lecture. Including the general shape of the distribution sample means, where these things are centered, the average of the sample means in a sampling distribution. And then the variability of the distributions and the relationship to the size of the samples each mean the distribution is based upon.

So lets look in the example here. We have a theoretical population we want to sample from. I created this with the computer. It's height measurements for adults greater than or equal to 18 years and pretend, you know we're doing research and we can only take the sample to try and understand what's going on. Well I know the truth here so for simulation purposes, I took two samples, one of size 50 and one of size a 100. So let's look at the observations in these. So in the sample of size 50 here this is the distribution individual heights amongst these 50 people. I mean, you know its only 50 points but we get some evidence that the population values that we're sampling from our own, perhaps somewhat symmetric and bell, may be a little bit bell shaped. That might be a stretch with this. And we have a sample mean of 166.9 centimeters. So that, that would be our best guess for the true mean height of all adults greater than or equal to 18 years based simply on this sample of 50. But since I have the population behind the scenes we can take another random sample. This time of a 100 people and here are the distribution of 100 heights, and it's a little more fleshed out than those with 50. We get a little more empirical evidence of maybe a symmetric roughly symmetric perhaps somewhat bell shaped distribution heights. The mean of this sample is 161.1 centimeters, and it differs slightly from the estimate we had from the other sample. So with these two samples, which we never have the luxury of having in real life research, we get some sense that the population distribution of heights is perhaps somewhat symmetric, and centered around the mean of somewhere on the order of 160 something. That's all we've got. So now, I use the computer to repeatedly draw samples from this population of adults, compute the mean for each sample and then plot them in a histogram to estimate the sampling distribution. And, I did this for samples of different sizes.

So, let's look. With this first simulation here what I did was I drew a 1000 samples. A 1000 samples, each with 20 observations, from this behind-the scenes population, and what I did is I computed the mean for each of these 1000 samples. And in this histogram here, these are not individual people measurements, each point in this histogram is a sample mean from a sample of size 20. So this histogram here has a 1000 sample means estimated from a 1000 random samples of size 20. This is an estimate of the theoretical sampling distribution for sampling means from samples of size 20 from this population of adults.

On this next slide, I've done the same thing, but I've increased the size, the number of people in each sample that I've taken. So I've taken 1,000 samples. Each sample contains 50 persons. And for each sample, I computed the sample mean. So, for sample one, I computed a mean of the, sample one had 50 people. And then I plotted this mean in the histogram. For sample two, I had 50 people.

But I didn't plot anything to do with the 50 measurements in there. I just planned, plotted the sample mean, for that sample and put that in this histogram. And, so in this histogram, this histogram has 1000 x bars. Essentially x bars of height each based on 50 observations.

Finally, if I do this one more time, but each sample I took now had 150 people in it. So I took sample one,

had 150 people in it and the only information I'm presenting about it in this graphic is, is in fact the mean.

So, sorry I can't seem to write that well. There we go, n equals 150. sample one. But I just, I, I'm not showing you the individual heights of the people in the sample. I summarized it with the mean. And the only information about this single sample that appears in the histogram here, is, in fact, its mean. And I did that a 1000 times. So we've got a 1000 sample means in this histogram, 1000 sample means, each based on 50 people 150 people.

So we have a 1000 sample means, each based on 150 people in this histogram. So now, let's look at, you probably noticed something going on and now I want to put these distributions of sample means side by side in box plots to sort of look at what the patterns here are. So, so what do you notice in this picture? Well, you probably get a sense before, by looking at those histograms go by. But here's the 1000 sample means, where each mean is based on 20 observations.

Here's the box plot of the distribution, 1000 sample means when each is based on 50. And here's a box plot when each is based on 150. So what do we notice here?

And that probably makes some sense to you, common sense wise. If I ask you, would you prefer a mean estimated from a random sample of 20 people or from 150 people, your intuition would probably say 150. And why do you think that is?

Well think about it carefully but we've talked about the influence of individual points on a sample mean value and what happens when the sample size increases. Each individual point has less an influence and that tends to make the mean more stable across different samples.

What else do you notice here? Well, look at where the center of these distributions are, at least the median.

So the medians seem to be pretty much lined up, so I can't draw a straight line here, but the medians, in fact, of these distributions are lined up. So these distributions have the same or very similar centers as measured by the median and the distributions look roughly symmetric so the median is close to the mean. They have similar centers, but the variation in the estimated means is decreasing the larger the sample size. But the average value, meaning value, is the same across the different distributions.

So, let me tell you now the punch line. I actually simulated these data. These samples of data were taken from a distribution of population mean heights where the true mean was 167 centimeters. And the standard deviation of the individual height measurements was two point five centimeters. So let's look at some numerical summary of those pictures we just saw. If you took the mean of the 1000 sample means based on samples of size 20 at one, at each time. The mean of those 1000 sample means is 167 which is actually equal to the true mean. The mean of those

samples means based on 50, we took a 1000 samples each based on 50 people the mean of those estimates is 167. And the mean of the sample means each based on 150 people is 167. So what do I mean by mean of means? Right, well we saw there was a distribution, there was variability in those sample mean estimates, but on average, those sample means came in at 167 which happens to be the true mean.

If we look at the variation of the sample means, we can see in all three scenarios it's less than the variability in the individual measurements, individual height values from our population. And as we saw visually, it decreases the more information is in each sample. So what are we sort of tying this up, showing empirically? We're showing that the sample means, on average, turn, equal the true mean from the population in which the sample's taken. But there's some variation in the estimates around that truth. And that variation decreases, the larger the sample each mean is based upon.

Just FYI, this simulation is a great way to illustrate a principle, and help us understand this definition of a sampling distribution. But it's not something that we can do in real life. In real life we're only going to be able to take one sample from each of the populations we're interested in studying. That's generally the case. The variation in the sample means that I've showed you depends on the size of each sample and not the number of samples that I've done in the simulation. So just to illustrate this I could have done the same thing and take in 5000 samples, each of size 20. And 5000 samples, each of size 50. And 5000 samples of 150. Instead of doing 1000 each time. And if you look at the distribution of the sample means across these 5000 samples with each sample size near, the distributions are very similar to those that we saw,

those that we saw with a 1000 mean. So the size of simulation, the number of times I actually sample, does not systematically effect these distributions, which fueling the differences in variability that we're seeing is the size of each sample that each mean is based upon in the graphics we see.

So this is important to note. In real life research, researchers will only be taking one sample from each population under study. As such if it was the deterrent or if it was the number of samples that determined the variability in sample means this would make research impossible. So let's look at another example just to try, try and flesh this idea out more. Here's another population, hospitals in the US in 2011. the discharges for kidney and urinary infections. So this is actually based on a database. And I'm actually using it behind the scenes, a large database to be my population, and I'm taking some samples from it, to illustrate this principle. So lets just say I was a researcher and I could only afford to study 50 hospitals. And so I took a I got a random sample from CMS, or Medicare and Medicaid services in the US, and this is what I got. And here's what shows my sample discharge counts for the 50 hospitals randomly sampled. You can see this distribution seems to be somewhat right skewed and the mean in this sample is on average, the average hospital, at least 50, discharged 69.1 persons for kidney and urinary infections in 2011. If you look at sample B, which is based on 250 hospitals, I suppose another researcher could do a bigger study. It, it, it, has the same characteristics but more fleshed out than the distribution sample A, and again, in these graphics here, this is the distribution of the counts for the 250 individual hospitals in my sample. Each point in here represents the number of patients discharged from one hospital for urinary and kidney infections. And the mean amongst these 250 that I've sampled is 71.7 discharges, so now we have some sense from looking at these two samples, again, a luxury we wouldn't normally have, that the true distribution that we're sampling from is right-skewed and has an average somewhere on the order of high 60s low 70s. That's all we can ascertain right now.

Remember, the distribution of individual values in any single sample from a population should imperfectly mimic the distribution of individual values in the population regardless of the sample size.

So now I'm going to repeat the exercise of sampling repeatedly for samples of different sizes and looking at that distribution of the resulting sample means. So this graphic here shows the estimated sampling distribution for sample means or random samples of size 50 from this hospital discharge population. So, again, now, this histogram no longer contains individual hospital measurements, but it contains the mean. Each point in here is a mean, is a mean from a sample of 50 hospitals. So, we have this case. I got a little more adventurous and decided to repeat the simulation 2000 times. So,we have 2000 X bars, each from a sample of size 50, 50 hospitals.

Here we're going to do this again, but we're going to now do this where our samples contain 250 hospitals each.

And so again, we've got 1000 x bars, excuse me 2000 x bars in this histogram, and each x bar is based on 250 hospitals. So we have 2000 summary measures each summarizing the distribution of 250 hospital discharge accounts. Finally, we do this one more time. And here what we have is the estimated sampling distribution of sample means where this random samples are now, each contained 400 hospitals.

So, again, in this distribution, we have 400 x bars. Each one is a mean number of discharges for a sample of 400 hospitals. And we have 2000 means in this picture.

So now let's, let's put these all on one graphic and summarize the results. So what do we see here? Now this looks very similar to what we saw before.

If you look carefully at this picture you can see what we saw before. What do you see here? So, we've got the these are the means based on samples of size 50, there's a 2000. These are the means based on samples of size 250, and these are the means based on samples of 400 possibles in the niche. So, what do we see here? Well, we again see that the variation in our sample mean estimates decreases, the more information your sample mean is based upon. We also see that the center that these distributions of sample mean estimates was, we have some outliers, but on the whole looked pretty symmetric. They seemed that way in the histogram presentation as well. And finally, we see that the, the centers of these distributions,

and I'm not, again, I'm not doing a good job of straight lining here, but they tend to line up. So, the results show us what? That the distribution of the sample means,

I'll say somewhat normal. Even though the individual values in any one sample the distribution of the indivudal values was right skewed. What else did we see? That the average, roughly the average and the median because these are roughly symmetric distributions of that 2000 sample mean values, regardless of sample size was consistent, [BLANK_AUDIO] across the three sample size scenarios, [BLANK_AUDIO] 50, 250, and 400. And then finally, we saw what we saw before with the, height data, is the variability. In the 2,000 sample means, decreased, went down, when the size that each sample was based on, each sample mean increased.

So now, we'll come clean about what the data looked like, the population that this came from. The true mean, number of discharges in this population, the true mean was 69.2 discharges. And the standard deviation of these discharge counts was 58.4. So there was a lot of variation and the population distribution was right skewed. But let's look at the results. Some numerical summaries of the pictures we just looked at. Regardless of the sampling distribution estimate we were looking at whether it's based on 2,000 means, based on 50 hospitals of time, 250 hospitals or 400, notice that the mean of our samples means was consistently very close or equal to that underlying population truth.

Further notice that the variation in these sample means, the 2000 means we had in each estimated distribution, in all three cases, was substantially smaller than the variation in, in the individual hospital to hospital counts. Variation in the means was lesser than the variation in the individual values. And it decreases with increase in sample size, which we already noted.

Theoretical sampling distributions for sample means, across random samples of the same size, from the same population, can be estimated by a computer simulation, and that's what we've done here and we'll do it in the next lecture set. Simulation's a very useful tool for helping explore the properties in the sampling distribution, and drawing them, basically. If I tried to do this by hand, it would take forever.

Some properties observed with the two examples in this lecture, which will be generalized to all such cases, shortly include the things we just noted. Yeah. The variation in sample means decreases from sample to sample, and sample means across samples

On average regardless of sample size, the means, so it's kind of a weird thing but the overall mean of these sample means, these are just numbers so we can average them even though each one represents an average of the sample is close to oh, very close to really, the true

The thing that our sample means are estimating. Whether it be the mean heights for everyone in the population. For the mean discharge for all hospitals in the population. And finally we did see in both cases with very different shapes for the individual data in any one sample. In the first case it was roughly symmetrical for the heights. For the second case it was skewed for the individual hospital discharge counts. That the distribution of the averages from samples across

is and I'll put in quotes normal, to mean approximately normal, roughly symmetric and bell-shaped. So what we're going to see is ultimately, we can't do these simulations. We can only take one sample in real life. So ultimately, estimating the characteristics of a sampling distribution will be done using the results from a single random sample from a population. In lecture section D, these properties that we've been demonstrating empirically via the simulations in this lecture set, will be generalized. We'll see we don't have to take multiple samples either with the computer or, or by hand to understand how our statistic would behave across multiple random samples of the same size. There's some machinery that will just formalize the patterns we've seen thus far.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.