Sampling variability and the essential limit theorem

should not be new concepts to you anymore.

However, in this unit we're shifting the focus away

from numerical variables and focusing on categorical variables only.

So, in this video, we're going to start by

talking about the sampling distribution for a sample proportion.

because remember, when we're dealing with categorical variables, the parameter

of interest is no longer a mean but a proportion.

And we're also going to define the central limit theorem for

proportions, which is very similar to what we've seen before

but a different measure of the standard error as expected.

And we're going to walk through the conditions for

the, that central limit theorem to hold as well.

Let's revisit quickly what we mean by a sampling distribution.

Say you have a population of interest and you take a random sample from it.

And based on that random sample, you calculate a sample statistic

If in that sample the variable of interest is a categorical

variable, the sample statistic is going to be a sample proportion.

Then we take another sample, and also calculate the sample proportion from that.

And then another one, and then another one.

And this goes on for a long time, because we

want to think about taking as many samples as we can.

The distributions of the observations with

in each sample is called sample distributions.

However, when we look

at the distribution of the sample statistics,

this is what we call our sampling distribution.

And remember that these two are not the same thing at all.

In the sample distributions, the observations are individual.

Let's say people or cases, whatever it is that your

sampling verses in a sampling distribution the observations are sampled statistics.

Let's give a little bit more concrete example, say we want to estimate

the proportion of smokers in the world.

So our population is our world population, and capital N is going

to be our population size, so this is everybody in the world.

And our parameter of interest is p, the

proportion, the true proportion of smokers in the world.

If we actually had data from the entire population, we could calculate this

p as the number of smokers in the world divided by the total

population size.

But we don't have data from every single person in the world, so

we're going to think, so let's say that you're taking many samples from this.

So the idea here is not necessarily

a realistic situation where you're doing data

analysis per se, but we're trying to

illustrate what we mean by a sampling distribution.

So you start with the first country on the roster, Afghanistan, and you sample 1000

people from Afghanistan.

And you ask each individual one are you a smoker or

not, and record a yes or a not for each individual person.

Then so on and so forth, you go to many countries.

Let's say you take another ra, random sample of 1000 from

the U.S. again, asking each person are you a smoker or not?

And recording a yes or a no for them.