Hi, my name is Brian Caffo. This is Mathematical Biostatistics

Bootcamp, lecture ten on T Confidence intervals.

So, in this lecture, we're going to go through group T intervals whereas the last

lecture we did T intervals for a single mean or you could do those intervals for a

group where the observations were paired. But now we're gonna talk about instances

where we have two independent groups. We'll briefly talk about a method that

construct a likely hood and then we'll talk about what you do if you have unequal

variances. Let me motivate the problem a little bit.

Suppose that we want to compare the mean blood pressure between two groups in a

randomized trial, those who received treatment to those who received placebo.

Unlike last week, where people would have had to have been matched, say, comparing

the same person before and after receiving a treatment, these groups are entirely

independent. The group that received the treatment and

the group that received the placebo. So we can't use the same procedure.

We can't take pairwise differences between measurements.

In fact, they might have different sample sizes in the two groups, and then we

definitely couldn't do it. So this lecture, we're going to talk about

ways for investigating the differences in the population means between groups when

we have independent samples. But we'll see that the methodology works

out to be very similar to what we did last week, the motivating ideas will be nearly

identical. So let's go through some assumptions that

we're going to use for our first variation of the t interval.

So our first collection is the X1 to Xnx is a collection of IID random variables.

They have some mean and they have some variance.

And Y1 to ny are IID normal. And they have a different mean but the

same variance. Right now we're, going to assume the

variance between the two groups is the same.

So we might think of X as the treaty group and Y as the control group.

Or X is one group, and Y is another group. So let's let X bar, Y bar, Sx and Sy be

this means in standard deviations for the two groups.

And we're, our goal is to estimate, say, the difference.

Mu X - mu Y or of course, you could do mu Y - mu X. and look at the negative of the

answer. We would like to estimate that.

But we'd like to have a confidence interval to quantify our uncertainty in

estimating that parameter. So the obvious estimator of saying mu Y -

mu X is Y bar - X bar. I think everyone would agree that the

interval needs to be centered at that point, or that point has to be central in

the construction of the interval. But we also need to figure out some way to

create a confidence interval to incorporate our uncertainty.

Well let's think can we do something that's along the lines of estimate plus or

minus a T quantile times a standard deviation.

Well, we want a standard error of this estimator Y bar - X bar.

If you turn to the calculations, and I would hope that everyone in this class

could do this calculation at this point, under the assumptions that we've made, the

variances of Y bar - X bar works out to be sigma2<i>1 squared times one / nx + one /</i>

ny. And there's a really good estimator of

that entity in this setting. In fact it's a maximum likelihood

estimator, or close to it. And that's the pool variants estimate,

Ssubp^2.. And that works out to be nx - one Sx^2 +

ny - one Sy^2 / nx + ny - two and this works out to be good estimator of sigma

squared. Let's talk about this estimator really

quickly. If you take nx - one and you divide it by

nx + ny - two you get a number that's between zero and one.

And if you take ny - one and you divide it by nx + ny - two you get one minus that

number. You can check the calculations to make

sure that I'm right about that but I'm right.

So, this estimator, ssubp^2 is nothing other than a weighted average of the two

group variances, right? So, it's a weighted average of the

variance for group x plus the variance for group Y.

If nx and ny are equal, if you have the same sample size in both groups, then you

can calculate nx - one over nx + ny - two works out to be 0.5 in which case the

pooled variance estimate works out to be the arithmetic average of the two

variances. On the other hand, if the group x contains

a lot more data. Right?

Nx - one is a lot larger than ny - one. Then nx -

One over this, denominator is going to be much bigger, and you'll get a much bigger

weight on Sx^2 / Sy^2. And in that case, the weighted average

does exactly what you would hope, is it takes whichever of the two groups that has

more measurements associated with it and weights the variance estimate from that

group more heavily which is exactly what you would hope.

There is more data. This variance estimate is going to be

estimated a little bit better since it has more data.

So it makes sense that a good estimator would place more weight.

And so that's basically what this pulled variance estimate is, it's other than an

average. It's just say a, what we called simplicial

average rather than a arithmetic average. Okay so just to reiterate some of these

points. The pooled estimator is a mixture of the

group variances placing bigger weight on whichever one has the larger sample sizes.

If the sample sizes are the same, it's really easy.

All you have to do is average the two variances.

And then the pooled estimate is unbiased. We can show that really quickly.

If you take the expect of value of ssubp^2 you just use the fact that the both of the

individual group variance estimators are unbiased and then you wind up with the

result. I'm not going to show this, cuz it's kind

of complicated to do. But the pool variance estimate turns out

to be independent of Y bar - X bar. But the reason is, if you stomach this

fact that I didn't show before, that X bar is independent of Sx and Y bar is

independent of Sy. Well then X bar - Y bar is going to be

independent of Sx - Sy because all of the collections of things are independent,

then because of that, it should be independent of any function of Sx and Sy

which ssubp^2 is a function of Sx and Sy. So I'm not going to dwell on this point

but take it as given that Y bar - X bar is independent of the co-variance estimate.

And hopefully you can kind of get a sense where I might be going with this

calculation, what I'd like to do is create a T confidence interval.

And remember, what did I need to create a T confidence interval?

I needed to figure out a way to get a standard normal and divide it by the

square root of a Chi-squared2 divided by its degrees of freedom, an independent

Chi-squared2. So, I'm hoping that some function of the

pool variance will be Chi-squared.2. And I just stated without proof that it'll

be independent of the difference in sample means.

Well, it turns out you know, another fact that I'm not going to prove, but one that

you can certainly take to the bank, is that the sum of independent Chi-squared2

random variables is again Chi-Squared And the degrees of freedom just add up.

So let's take nx + ny - two times the pooled variance divided by sigma squared.

Well, that works out to just be nx - one times the X group variance, divided by

sigma squared, and Y - one times the Y group variance, divided by sigma squared,

and we know from before that this first term is Chi-squared with nx - one degrees

of freedom. The second term is Chi-squared with ny -

one degrees of freedom. And so if you believe my fact above, that

the sum of two independent Chi-squared is again Chi-squared with the degrees of

freedom added, That would mean that when we add this

Chi-squared2 with nx - one degrees of freedom and this chi squared with ny - one

degrees of freedom, We get a Chi-squared2 with nx + ny - two

degrees of freedom. And of course we're happy assuming that

the two Chi-squared2 are independent because the entire presumption of

everything we're talking about is that the two groups we're looking at are

independent. This is sort of independent group

analysis. We're assuming that group X and group Y

are independent. Okay.

So now we can construct our t, T statistic.

So we take Y bar - X bar, subtract off its mean, mu Y - mu X and divide by its

standard error. Sigma times one / nx + one / ny square

root. And then divide the whole thing by nx + ny

- two ssupb^2 over sigma squared, which is a Chi-squared and then that is divided by

its degrees of freedom nx + plus ny - two. So if you look at that, that top part is a

standard normal so the original data for the two groups are Gaussian, so that we

know that the sample means are Gaussian, so that we know the difference in the

sample means is Gaussian. And if we take a Gaussian, and subtract

off it's mean and divide by its standard deviation, we wind up with a standard

normal. So the top is a standard normal.

We're stating that the top is independent of the bottom.

And then the bottom we know is the square root of a Chi-squared divided by two

degrees of freedom. So the whole thing has to be a T random

variable with nx + ny - two degrees of freedom.

And then if you collect terms and work with the arithmetic a little bit,

You see that this left hand side works out to be Y bar - X bar, -mu Y - mu X, whole

thing divided by ssubp times one / nx + one / ny square root, which is basically

just the statistic we'd like to use, which is the observed difference in means minus

the population difference in means divided by the standard error; but with sigma

replaced with our data estimate of sigma, so sigma replaced by ssubp..