0:10

I'm sure that some of you are still

wondering about why you need to learn statistics or why do so many organizations

who are looking for people who understand how to use statistical methods?

Just recently, Career Cast, a web based employment service listed the best jobs

of 2016 and once again, we see statistics on the top of the list.

Speaking to the enormous need for scientists who will be slicing and

dicing the data companies have so they can improve their decision making.

So for you, as someone who's interested in leadership roles, this is also important.

If you don't ask the right questions, then the analysis

done by the most talented statisticians will be of very little use.

You need to be able to understand statistical analysis,

ask the right questions and shape the future of the inquiries.

In this module, we are now ready to begin the process of making inferences, so

let's get started.

2:58

Using only one value for estimating the population mean or

any other population parameter leaves much room for error.

It is much better to provide a range, this range is called an interval estimate.

This is still an estimate, and

we can't be certain that it actually contains the true population parameter.

The probability that that interval actually contains

the true population parameter Is called the level of confidence.

3:40

I think we all have suffered with headaches and

when we take a pill we want to get relief as fast as possible.

So consider this, two drugs are being tested for headache relief.

We want to know the time it takes to experience relief and

the testing is done on a group of size 100.

One group takes Drug A and the other one takes Drug B.

Drug A average time was 38 minutes before they felt relief.

Drug B, average time elapsed was 43 minutes before they felt relief.

Base on the study, the average time for Drug A is five minutes less but

in the big picture, can you conclude that Drug A really acts faster?

Could it be the make of the people who were in the group

A that made them feel pain relief faster?

What if the conclusion of the study showed the following?

Now Drug A resulted in 20 minutes faster relief.

Are you more likely to think that Drug A is more effective than Drug B?

Of course there can still be other explanations about why Drug A group

reported faster relief but given this larger difference, the other possible

explanations maybe less likely be the reason for the difference we are seeing.

Now it would be a good time to explain the concept of margin of error.

When estimating the population mean, begin with the best point estimate from

the sample mean and then add and subtract the margin of error.

I mentioned that you might have seen the phrase margin of

error in news programs and business reports and etc.

Here's an example for margin of error.

This image is from Gallup Daily Economic Confidence Indexes from April 15, 2016.

Look closely and you will see the sample size used and the margin of error.

By the way, what exactly is the margin of error?

Well, margin of error is mathematically is made of many things.

One is the size of your sample.

Remember, you should know this now at the gut level that the larger the sample size,

the more reliable your estimation will be.

That means larger sample size will reduce the margin of error.

Then, it is the natural variability that you have in your sample

which is represented by the samples standard deviation.

Finally, the confidence level which describes uncertainty

of the sampling method.

In another word, how confident are you that you have studied and

sample which contains the true population parameter?

The notion for this is one minus alpha.

Alpha is known as the significance level.

So if you want a 95% confidence interval the most commonly confidence

level used then you're essentially saying that there is a 95%

chance that the sample you took will allow you to make a correct inference

about the true population parameter and 5% chance of missing the mark.

Let's explode the concept of confidence interval a little further.

6:52

This is a distribution of sample means.

Central limit theorem shows that if you take many samples and

then plot each sample mean, we will end up with approximately a normal distribution.

Thus, if you take just one sample then we expected about 68% of all sample means

to lie within one standard deviation of the population mean.

95.5% of the sample means to be within two standard errors

of the of the population mean, and 99.7 of the sample means will

lie between three standard errors of the population mean.

So if you are considering a 95% confidence level,

then you are implying that the sample that you took will have a mean

which would be roughly about plus or minus two standard errors from

the mean of this distribution, 1.96 is to be exact.

8:18

So then the equation for confidence interval, for estimating mean of

a population is the sample mean, plus or minus the margin of error.

And the margin of error is calculating by finding the z

score of the confidence level desired times the standard error.

Where the standard error is calculated by taking the standard deviation of

the population and dividing it by the square root of the sample size.

9:05

It's well understood that we do sample studies because we don't

know something of interest about our population.

Thus, we don't know what the population mean or standard deviation is.

But the statistical equations use different notations and

distributions when we know the population's standard deviation, and

when we don't know this value.

To be completely precise, if we know the population standard deviation,

that is sigma, then to find the z score corresponding to our desired

confidence level we will use the actual sigma and the normal distribution and

find its z score in order to calculate the confidence interval.

However, if we don't know the population's standard deviation,

then we use the sample standard deviation s as an estimate

of sigma which is the actual population standard deviation.

And another distribution as known as

t distribution in order to calculate the confidence interval.

10:19

The t distribution is similar to that of the standard, normal distribution.

Both are symmetrical and bell-shaped.

However the t distribution is more spread out than the standard,

normal distribution.

That is, it has more area in its tails and in less in its center.

The amount of spread of the t distribution is given by degrees

of freedom which is n minus one that's sample size minus one.

Now, let me analyze this graph, the most peak and

narrow curve in this graph is a standard normal curve.

The most spread out curve shows the t distribution for

sample sized used was three, a very small sample.

The other curves our t distribution plotted

as the sample size has increased to four and then ten.

One thing you notice is that as we increase the sample size,

the t distribution becomes more peaked and narrower, and

approaches a true normal distribution.

Thus, if you have a fairly large sample size,

then the two curves become more or less identical.

Actually past sample size of 30, these two distributions become very similar.

Since in this class, we only plan to work with large sample sizes,

I will always use a normal distribution thus a z score for

calculating the margin of error when illustrating a problem in my lectures.

When solving problems using a software like Excel

it is just as easy to use the t score and the t distribution.

So in a sense, t score, z score become almost synonymous and

interchangeable when we have large data sets.

12:07

As I said, to be perfectly precise, we should be using a t distribution.

But for large sample sizes where the population standard deviation is

not known,

using just a standard normal distribution is an extremely good approximation.

Just to show you that numerically, consider the following example.

There is a population which you don't know anything about and

this is why we are taking samples from it.

We take a random sample of varying sizes from this population.

I started with 30, because at 30 t distribution and the standard

normal distribution become very close and get closer as the sample size increases.

For each of the sample taken, we calculate the sample's standard deviation.

Which will be used as a point estimate of the populations standard deviation.

12:59

So now let's see how the t score and

z score will be if we wanted a 95% confidence interval.

This table shows the t score for the different sample sizes I have here.

As we increase the sample size the t-score starts

getting closer to 1.96, the actual z-score.

So why do I say when I'm demonstrating concepts to you I will only use z-score?

Because z-score stays the same regardless of the sample size.

So for the most commonly used confidence interval of 95%,

you know that value is 1.96, approximately two standard errors from the mean.

So you can focus on calculating the standard error.

T-score, which is based on distribution, on the other hand, is dependent on

the sample size, which makes it impossible to memorize a t-score, it is not unique.

And that would prohibit you from doing any quick calculations or mental math.

14:28

There are three confidence intervals that are used most often.

95% is the most often used interval, whenever the confidence

level is not mentioned, you can assume that it is at 95%.

This value is the default value in most of the statistical softwares as well.

A good approximation of the t-score using a normal distribution is 1.96.

The other two are 90% confidence interval in which case z score is 1.645 and

99% confidence interval and for that z-score is 2.576.

You can pretty much memorize this for

quick references as you look at business reports and analysis presented to you.