A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

188 ratings

Johns Hopkins University

188 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Introduction and Module 1

This module, consisting of one lecture set, is intended to whet your appetite for the course, and examine the role of biostatistics in public health and medical research. Topics covered include study design types, data types, and data summarization.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In much of public health research the

interest is in studying or comparing populations.

But frequently we can't observe all elements in a population

and have to resort to taking some sort of sample.

In this section we will discuss the idea of populations and samples.

And then talk about ways to obtain

hopefully representative samples from the larger populations.

We will briefly discuss random sampling.

And some other of to used

methods for getting a, as representative

as possible sample from populations under study.

Okay in this section, we're going to

define the idea of samples versus populations.

Something that will permeate everything we do in statistics.

So upon completing this lecture section, you should be able

to explain the difference between a population and a sample.

At least so far as the terms are used in research,

give examples of populations and of a corresponding sample from a population.

Explain that characteristics of a randomly selected data sample.

Should imperfectly mimic the characteristics of the population

from which the sample was taken.

And then explain how non-random samples may potentially differ

systematically from the populations from which they were taken.

So let's get started operationally we define a sample in a population.

In research a sample is a subset or portion of a larger group,

a population from which information is

collected to learn, about this larger group.

A population is the entire group for which information is wanted about.

So for example, we may want to study all 18 year old

male college students in the United States.

That may be our population.

But a sample would be some subset, for example 25

of these 18-year-old male college students in the United States.

We want to know something about the entire population, but

we can only view it through this sample of 25.

So, for studies it is optimal if the sample which

provides the data is representative of the population under study.

That's an ideal, but it's certainly not always possible.

But for this term we'll, we'll generally make the

assumption that our sample

is representative, unless otherwise specified.

How would you go about getting a representative sample from a population?

Well on way of

doing this is called simple random sampling.

It's a sampling scheme in which every possible

sub-sample of a given size from a population.

Is equally likely to be selected.

So if size n, n could be anything. Let's say it's 25.

A simple random sampling for example, if we took a

simple random sample from the population of all 18 year old

male college students in the U S. Our scheme would be such that every

subset of 25 students from that population has the same opportunity of being chosen.

If a sample's randomly selected from a population,

the characteristics of a sample should mimic, albeit imperfectly.

Those are the population the sample was taken from.

So how could we take a random sample?

Well, it sounds easy to talk about, but it's easier said than done.

First, a master list of the population must be enumerated.

Must be drawn up.

And in many cases, this is impossible, if not incredibly difficult.

And then once this master list has been drawn up, using a

computer, a random subset of any size can be drawn from the population.

So essentially the computer acts like the proverbial names in

a hat situation, so as if we were to actually take

all the elements of a master list. Put them on pieces of paper, and put the

separate pieces of paper into a hat, shake up the hat and draw out a certain number.

That's what the computer does.

It's easier to employ with the computer, because it doesn't require head wear.

And reduces the risks of paper cuts.

So generally speaking, with research, we

want to learn about truths in the population.

But can only estimate these from

imperfect sample of observations from the population.

If this is our random sample however, it may be imperfect, but it's

characteristics should imperfectly mimic those of the population which we sampled.

So for example, just suppose for a moment,

we have a theoretical population we want to study.

And some of its characteristics include the 20% of the entire

population is composed of males who are less than 30 years old.

Another 15% population is males who are greater than, equal to 30 years old.

Another 26% of the population is females who are less than 30 years old.

And the remaining 39% consist of females

who are greater than or equal to 30 years old.

So this population is majority female, and greater than 30 years old.

If we took a random sample of whatever size from this population.

And looked at the, sex and age characteristics of our sample, we'd expect

them to be similar, but not perfectly match that of the population we took.

So we'd expect maybe about 20% of the sample.

Maybe it would be 18%. Maybe it would be 23%.

To be male and less than 30 years.

Expect another 15% or so,

about could be above or below that, to be males greater than equal to 30

years. Then another about 26% to be females.

Less than 30

years old.

And then finally, another 39% to be females who

are greater than or equal to 30 years old.

So again, these won't match perfectly per se necessarily.

But the basic sex and age makeup of the samples should be similar

to that of the population from which it's drawn if the sample is random.

So, let's give an example.

Researchers wanted to learn about the pulmonary health in a population of man.

They were able to sample 113 men from this population, and measure

the systolic blood pressure in each of the males in the sample.

And for this study, their sample was essentially a random

sample. So what they found,

and we'll define all these terms and how they're computed shortly in the course.

But what they found is on average, the 113 men, the

mean blood pressure, the average blood pressure was 123.6 millimeters of mercury.

That being said, there was a fair

amount of variation in the individual blood pressure

[INAUDIBLE]

measurements from male to male.

And we measure that by a quantity called the standard

deviation, while we'll formally define in the next set of lectures.

One way of getting a pictorial of this to see how the

variation behaves around these men, is to do what's called a histogram.

It's sort of like a bar chart that looks at

the percentage of observations, we have blood pressures that fall

within certain ranges.

And you can see this gives some sense of

the variation in those individual values and where the center

of that distribution is.

So if this were a random sample, we'd say that

well, we don't know the true population mean blood pressure.

But we certainly think this 123.6 is a good estimate.

And we think, were we to sample the entire population and do

a histogram of the individual blood

pressures for all males in this population.

It would be a nicer,

more cleaned up, smoother version of what we see in this sample.

So we'd expect what we see in these samples to be an imperfect representation.

Of the underlying true population

characteristics see we can't directly conserve.

So much of what we'll do in this class is

trying to go from these imperfect representations back to the population.

While recognizing the potential

uncertainty in our sample characteristics, given

that our sample is an imperfect subset.

So another study.

Researchers wanted to characterize the risk of mother

to infant HIV transmission within 18 months of birth.

The researchers studied 183 births to 183 randomly sample HIV positive women.

So, they had a random

sample of 183 HIV positive women and they

followed them after they gave birth to the children.

They followed the children for 18 months, to see that the children contracted HIV.

What they found is that among these 183 births, 40

of the children tested positive for HIV within 18 months.

For transmission percentage of 22%, so if we

were to trying to estimate the true risk or percentage of

children who contract HIV. From HIV positive mothers.

Our best estimate based on this sample would be 22%.

And be given that the sample was random, we'd expect this to be reflective.

But maybe not a perfect characterization of

the risk or proportion who would transmit

HIV to their children in the population at large.

Amongst all HIV positive pregnant mothers.

So we would use this as our starting point,

and what we'll see in this course is again.

We'll take this 22% and then add in some uncertainty to it

to reflect the fact that while we think our sample is representative.

It's still an imperfect representation of the larger population under study.

>> So random sampling is

actually an idea, and even with random

sampling there can be drop out or non-participation.

That leaves us with a sample that

may differ systematically from the population of interest.

This systematic difference is sometimes called bias.

But there's many situations where random sampling isn't even possible.

And so we'll consider some other types of

sampling strategies that may increase the potential for bias.

But are at least feasible to do

when random sampling can't be employed.

Certainly in many cases it's either really

difficult or impossible to do random sampling.

Now other types of sampling may be necessary but may also result in

samples whose elements do not reflect the makeup of the populations of interest.

Just in this discrepancy between what the characteristics in our sample

are, versus what they are in the population is called Bias.

And bias can creep

into a sample, if the sampling

procedure does not necessarily yield representative sample.

So, if we were trying to sample voters,

not registered voters, but those who actually vote.

In a US presidential election, but actually took the sample registers voters.

Then we may get a sample that differs from

our population of interest, those who will actually vote.

If we're trying

to study intravenous drug users in Chennai, India,

well there's no master list of such persons.

Intravenous drug users anywhere, there's no master

list so trying to get a representative sample.

Will be a difficult process, and it

certainly can't be done through random sampling.

If we're trying to sample patients with a certain disease,

and there's no nationwide registry, it may be

hard to obtain a representative or random sample.

If we're trying to get a representative sample of homeless persons in Baltimore.

Well certainly there's no master list we can enumerate and choose from.

And so our sampling techniques will not

necessarily be that of simple random sampling.

If we're trying to get men who have sex with men in Malawi, well again,

there's no master list that we can choose from.

So we'll have to think about how we could sample and

do our best to get a representative sample of such persons.

So what happens when non-random samples are taken?

And in many situations we can't take a random sample, so this

is always something to think about when interpreting the results of a study.

Is how was the samples or samples under study,

taken from their respective population.

So we might, if we were again studying this proverbial

population, who's composition in terms sex and age look like this.

And we took the non random sample.

We may get a sample, whose characteristics systematically differ

from those of the population of interest. So we might get 40% in

our sample, males less than thirty years, and another 40%.

This is extreme, males greater than 30 years,

and the remaining 20% will be split between

the female younger group and the female older group.

The unfortunate thing is, is we don't know the underlying population.

We won't know how this systematically differs.

But we may have an insight that it could

systematically differ, because we aren't taking a random sample.

So let's think about non-random sampling strategies, and we won't

spend a lot of time on this in this class.

But I just want to throw the ideas out

to you, cause you'll be seeing these in publications.

So how might we get our hands on

voters, who actually vote in the US presidential election.

Well, one thing we could do is, do

random digit dialing, phone dialing among registered voters.

Voters. We could then ask each person we contact.

Do you plan to vote? And

if the answer is yes, we'd ask them,

who? Who will you vote for?

And the answer's no, we say thank you very much, okay.

So what biases could creep into this process?

Well, start with random digit dialing, this is less of an issue

than it was perhaps 20 years ago and certainly 50 years ago.

But in the US, not everybody who is a registered voter has

a phone, so there could be some bias, because we're only getting people.

Getting people with phones.

And then once we've taken, registered voters.

And then we ask them, do you plan to vote or not?

Well, what kind of bias could creep in here?

Well maybe, they do not answer correctly. Maybe there's

an attempt to perhaps give the answer they think will be socially acceptable.

And say, yes I do plan to vote when in

fact they don't.

And that could lead to some bias in who we ultimately end up with here.

How about for situations where we're trying to sample

from populations that aren't well-defined, or socially marginalized, etcetera.

Such as there's no easy-to-get master list, like IV drug users, homeless people.

Men who have sex with men, well there's a couple procedures you can use.

You can use something called Convenience Sampling.

Which is we might set up camp at, say different

clinics, a needle exchange program, different clubs,

etcetera and enrol people who agree to enrol in our study.

So take subjects who are willing to be in our study.

And this certainly,

certainly can potentially introduce bias to the process.

Those who are willing to be in

the study, may be systematically different in different,

many ways from those who would not appear

in the study, or maybe they're very similar.

Unfortunately we won't know.

How about something called respondent driven sampling.

Which is where we ask those who initially agree to participate

in our study to bring us more persons willing to participate.

So the sampling process is driven by those who we initially get in the study.

Well, there's a couple biases that potentially keep, creep in here.

This selection bias as it's called, in that

there's a certain type of person who would initially agree to be in the study.

And that's going to permeate the recruitment process, because again

they're going to be asking people they know to participate.

So will be biased in the sense that we're

not going to nessacarily get access to everyone in the population.

We'll interests only those who know, those who initially agreed

to be in the study and then among those that

they asked, some will say no. So it's tricky, but these two ways might

be the only ways we can get samples from populations such as the ones listed here.

What kinds of other sampling strategies could be employed if we were trying

to get patients with a certain disease and there was no national registry?

Well, we might take a random sample, but we

might take it from a clinical or a hospital population.

So even though our sampling process is random from a list that we can enumerate.

While the list we can enumerate may be a biased subset

of the population we want to study.

See clinical hospital population is not necessarily the

entire population of people with a certain disease.

It's people with a certain disease who have access to healthcare.

And so even though we're randomly sampling from this group, this group we're sampling

from, may not be the entire population of interest.

So, generally speaking,

with regards to public health and medical research,

not all elements of a population can be studied.

As such a sample is taken from the population of interest.

As we've laid out here, random sampling is the best sampling strategy

for getting a sample, whose characteristics

imperfectly mimic the population of interest.

However, random sampling's not always feasible.

In many cases, it's not.

Other approaches can be used and a sampling procedure needs to

be considered, when applying the results from the samples of the population.

All of the

methods we employ in this course, will be postulated upon a representative sample.

And we can apply them under that assumption.

But when it comes to interpreting the results

in the context of what they mean scientifically.

We'll want to consider, how the samples

we're using were taken from their respective populations.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.