Hi my name is Brian Caffo, and this is Mathematical Biostatistics Bootcamp

Lecture six on Likelihood. In this lecture, we're going to define

what a likelihood is, which is a mathematical construct that is used to

relate data to a population. We are going to talk about how we

interpret likelihoods, talk about plotting them.

And then talk about maximum likelihood, which is a way of using likelihoods to

create estimates. And then we'll talk about likelihood

ratios and, and how to interpret them. Likelihoods arise from a probability

distribution. And a probability distribution is what

we're going to use to connect our data to a population.

So the idea behind this and a lot but not all of statistics follows this rubric is

to assume that the data come from a family of distributions.

And those distributions are indexed by a unknown parameter that represents a useful

summary of the distribution. To give you an example, imagine if you

assume that your data comes from a normal distribution, a so called Gaussian

distribution, so a bell shaped curve. To completely characterize a bell shaped

curve, all you need is its mean and its variance.

So the probability distribution, the Gaussian distribution or the bell shaped

curve has two unknown parameters, the mean and the variance.

And then the goal is to use the data to infer the mean and the variance.

And the ideas that the mean and variance from the Gaussian distribution are unknown

population parameters because the Gaussian distribution is our model for the

population. And the data or sample parameters is what

we are going to use to estimate the unknown parameters.

So the, the nice part about this approach other than quite a bit of other directions

and statistics is that the sample mean, the sample variance, these are all

estimators with the population model, you actually have estimands.

The sample mean is actually estimating something.

It's not just a statement about the data. It's an estimate of the population and

that's what we're going to be talking today and we're going to talk about a

particular way of approaching estimation and summarizing evidence in the data when

you assume a probability distribution using Likelihood.

Likelihood is a mathematical function, as a particular definition.

And it's just the joint density of the data evaluated, as a function of the

parameters with the data fixed and we'll go through an example.

Before we go through our example, I want to talk about what it is the likelihoods

are attempting to accomplish, and how we might interpret them.

So I'm going to put forward a particular theory of how likelihoods can be

interpreted and how they can be used and I guess I should stipulate that maybe not

everyone agrees with this theory. But the theory I'm going to put forward is

that, ratios of likelihood values measure relative evidence of one value of an

unknown parameter relative to another. So if you evaluate the likelihood with the

parameter of a specific value you get the number, and then you take the ratio with

the likelihood value, you get a different number.

That ratio, if it's bigger than one, it's supporting the hypothesized value of the

parameter in the numerator. If it's less than one, it is supporting

the hypothesis value of the parameter in the denominator.

So this is a somewhat controversial interpretation of likelihoods, but it's

the one I'm going to put forward. The second point is similarly

controversial, though there is a mathematically correct proof that at least

it motivates to, it actually doesn't prove to, but the, the statement I'm making too

is that given a statistical model so given a probability model and observe data,

there is a theorem called the likelihood principle that says all of the relevant

information contained in the data regarding the unknown parameter is

contained in the likelihood. Now, the likelihood principle has a

mathematically correct proof but not everyone technically agrees on its

applicability and its interpretation but nonetheless I'm going to put this forward

as a way that in this class we're going to interpret likelihood so that once you

collect the data, if you assume a statistical model, then the likelihood is

going to contain all of the relevant information.

It's interesting that this point two has very far reaching consequences to the

field of statistics if you believe it. Things like P values, and much of

hypothesis testing, and other staples of statistics become questionable if you take

point two as being true. So, you know for today's lecture, we're

going to take it as being true. And we'll talk a little bit about, maybe,

some of the controversy associated with it.

Probably from a much more practical point of view is point three which says, and we

already know this, but let's state it in the term of likelihood.

So when we have a bunch of independent data points, Xi then the joint density is

the product of the individual densities. So, equivalently as we said that the

likelihood is nothing other than the density of evaluators of the function of

the parameter, it's also true that their likelihoods multiply, so independence

makes things multiply. It makes the joint density multiply, it

makes the likelihood multiply. So, I had summarize that here in the

statement that is the, the likelihood of the parameters given all of the Xs is

simply the product of the individual likelihoods.

That last point I'd like to make on this slide is that, especially points one and

two, the likelihood assumes these interpretations of likelihoods.

One negative aspect of them is that you have to actually have the statistical

model specified correctly, and of course we don't know the statistical model really

ever. If we assume that our data is Gaussian,

that's an assumption. It's not generally something we know.

Maybe in some rare cases like in, for example, radioactive decay.

There are some physical theory that suggests that, that data is Poisson for

example. But in most cases, we don't actually know

that the statistical family is a correct representation of the mechanism that would

generate data, if we were to draw from the population.

So, I think the way in which people still rationalize using likelihood based

inference in these cases is that they say, well given that we assume this is the

statistical model, then we will adhere to the use of the likelihood to summarize the

new evidence in the data. Let's go through a specific example.

One of the more important examples, and it's very illustrative, so let's do it.

Consider just flipping a coin, but this coin, let's say it's a, an oddly shaped

coin. Maybe it's a little bent or something like

that. So you don't actually know what's the

probability of a head. Let's label that probability of a head as

theta. And then recall that the mass function for

an individual coin flip is theta to the x, one minus theta to the one minus x.

Here in this case the theta has to be between zero and one.

So if X is zero its a tails and X is one, its a head.

So if we flip the coin and the result is a head, then the likelihood is simply the

mass function with the one plugged in, right?

So in this case we get theta to the one, one minus theta to the one minus one which

works out to be theta. So the likelihood function is the line,

theta where theta takes values between zero and one.

And if you accept our laws of likelihood and likelihood principle and then

interpretation of likelihood that I sort of outlined in the previous page, then

this says that consider two hypothesis. The hypothesis that the coin's true

success probability is 50%, .5 versus the hypothesis that the coin's true success

probability is .25. In the light of the data, right?

The one head that we flipped and obtained, the question is what is the relative

evidence supporting the hypothesis that the coin is fair, .5 to the coin is unfair

with the specific success probability of.25, we would take the likelihood ratio

which is then .5 divided by .25, which works out to be two.

So if you accept our interpretation of likelihoods, this would say there is twice

as much evidence supporting the hypothesis that theta equals .5 to the hypothesis

that theta equals .25. So that is the idea behind using

likelihoods for the analysis of data. Now let's just extend this example.

So, suppose we flip our coin from the previous example but instead of flipping

it just once we get the sequence one, zero, one, one.

I have kind of a funny notation here. I am going to write script L as the

likelihood and L is a function of theta. But it depends on the data that we

actually observe, one, zero, one, one and so we're assuming our coin flips are

independent. And so what happens with like this?

Will you take the product? So, here I have the first coin flip theta,

to the one, one minus theta to the one minus one.

Here I have the second coin flip theta to the zero, one minus theta to the one minus

zero and so on. So, I take the product of all of those and

you get theta cubed, to the one minus theta, raised to the first power.

And that's the likelihood for this particular configuration of ones and

zeroes from four coin flips. Notice, however, that the order of the 1s

and the 0s, does it matter? Regardless of the order as long as we got

three ones and one tail, the likelihood was going to be equivalent.

It was going to give you theta to the three, one minus theta to the one.

So, that is a property of likelihoods. It's illustrating that, if you have a

coin, the particular configurations of zeros and ones doesn't matter.

All of the relevant information about the primary figure is contained only in the

fact we got a specific number of heads and a specific number of tails.

It doesn't depend on the order whatsoever. And in this case, because we know how many

coin flips we have, all we need to know is the specific number of heads.

So instead of writing likelihood of theta depending on 1,0,1,1, we might write it as

likelihood of theta depending on getting one tail and three heads because it's the

same thing. It's, the order is irrelevant.

This by the way raises the idea of so called sufficiency in this case, you know

the number of heads in the total coin flips is sufficient for making inferences

about theta. You don't need to know the data actually,

all you need to know is the total number of heads and the number of coin flips.

So often, that total number of heads, you know, conditioning on the fact that we

know the total number of coin flips, is called a sufficient statistic.

It's saying that there's a reduction in the data, to make inferences about the

parameter, you only need to know a summary of it, a function of it.

And in this case a function that we need to know is the sum, total number of heads.

So let's do a likelihood calculation again and let's take the likelihood supporting

the coin is fair that theta is .5 and divided by the likelihood assuming that

the coin is unfair specifically with a 25 percent chance of heads and we get the

ratio of 5.33. So in other words, there's over five times

as much evidence supporting the hypothesis, that theta is.5 over the

hypothesis that beta is .25. Now, that relative values of likelihoods

measure evidence. Well, that's useful but we're not

particularly interested in say, .25, I mean the .5 is kinda interesting cuz the

coin is fair. But you know, most of the other points,

you know, we're not interested in .25 anymore whether we're more interested in

.24 and so on. So we'd like a way to consider likely at

ratios of all values of the, parameter theta.

And this is just simply a likelihood plot, that simply plots theta by the likelihood

value. And remember that likelihoods are really

interpreted in terms of relative evidence. So it's the fact that the ratio of the

likelihood of .5 to the likelihood of .25 is five is what's saying that we have five

times as much evidence. So it actually doesn't matter.

Constants that don't depend on theta don't matter in the likelihood, right?

Because when you take the ratio of, if, if there's a constant that doesn't depend on

theta. If it's in the numerator and the

denominator it'll just cancel out. The likelihood in their interpretation

should be in variant to constants that are not a function of the parameter.

So because of that the raw absolute value of the likelihood isn't all together that

informative, so we need to pick a rule for kind of normalizing it, and so why don't

we just divide it by it's maximum value so that it's height is one.

And that seems to be a pretty reasonable rule, and it helps with interpretations, I

think. And again I just want to reiterate this

last point. Because likelihoods you know, if you're

going to buy into this sort of likelihood paradigm of interpreting likelihoods,

everyone agrees that they measure relative evidence rather than absolute evidence.

So, you know dividing the curve by its maximum value or any value, you know, it

doesn't change it's interpretation. It's actually an interesting question I

might add to try and think of could someone create an absolute measure of

evidence in statistics and I'm not aware of any.

But, so we'll have to stick to relative measures for right now.