Hi my name is Brian Caffo, and this is Mathematical Biostatistics Bootcamp Lecture six on Likelihood. In this lecture, we're going to define what a likelihood is, which is a mathematical construct that is used to relate data to a population. We are going to talk about how we interpret likelihoods, talk about plotting them. And then talk about maximum likelihood, which is a way of using likelihoods to create estimates. And then we'll talk about likelihood ratios and, and how to interpret them. Likelihoods arise from a probability distribution. And a probability distribution is what we're going to use to connect our data to a population. So the idea behind this and a lot but not all of statistics follows this rubric is to assume that the data come from a family of distributions. And those distributions are indexed by a unknown parameter that represents a useful summary of the distribution. To give you an example, imagine if you assume that your data comes from a normal distribution, a so called Gaussian distribution, so a bell shaped curve. To completely characterize a bell shaped curve, all you need is its mean and its variance. So the probability distribution, the Gaussian distribution or the bell shaped curve has two unknown parameters, the mean and the variance. And then the goal is to use the data to infer the mean and the variance. And the ideas that the mean and variance from the Gaussian distribution are unknown population parameters because the Gaussian distribution is our model for the population. And the data or sample parameters is what we are going to use to estimate the unknown parameters. So the, the nice part about this approach other than quite a bit of other directions and statistics is that the sample mean, the sample variance, these are all estimators with the population model, you actually have estimands. The sample mean is actually estimating something. It's not just a statement about the data. It's an estimate of the population and that's what we're going to be talking today and we're going to talk about a particular way of approaching estimation and summarizing evidence in the data when you assume a probability distribution using Likelihood. Likelihood is a mathematical function, as a particular definition. And it's just the joint density of the data evaluated, as a function of the parameters with the data fixed and we'll go through an example. Before we go through our example, I want to talk about what it is the likelihoods are attempting to accomplish, and how we might interpret them. So I'm going to put forward a particular theory of how likelihoods can be interpreted and how they can be used and I guess I should stipulate that maybe not everyone agrees with this theory. But the theory I'm going to put forward is that, ratios of likelihood values measure relative evidence of one value of an unknown parameter relative to another. So if you evaluate the likelihood with the parameter of a specific value you get the number, and then you take the ratio with the likelihood value, you get a different number. That ratio, if it's bigger than one, it's supporting the hypothesized value of the parameter in the numerator. If it's less than one, it is supporting the hypothesis value of the parameter in the denominator. So this is a somewhat controversial interpretation of likelihoods, but it's the one I'm going to put forward. The second point is similarly controversial, though there is a mathematically correct proof that at least it motivates to, it actually doesn't prove to, but the, the statement I'm making too is that given a statistical model so given a probability model and observe data, there is a theorem called the likelihood principle that says all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood. Now, the likelihood principle has a mathematically correct proof but not everyone technically agrees on its applicability and its interpretation but nonetheless I'm going to put this forward as a way that in this class we're going to interpret likelihood so that once you collect the data, if you assume a statistical model, then the likelihood is going to contain all of the relevant information. It's interesting that this point two has very far reaching consequences to the field of statistics if you believe it. Things like P values, and much of hypothesis testing, and other staples of statistics become questionable if you take point two as being true. So, you know for today's lecture, we're going to take it as being true. And we'll talk a little bit about, maybe, some of the controversy associated with it. Probably from a much more practical point of view is point three which says, and we already know this, but let's state it in the term of likelihood. So when we have a bunch of independent data points, Xi then the joint density is the product of the individual densities. So, equivalently as we said that the likelihood is nothing other than the density of evaluators of the function of the parameter, it's also true that their likelihoods multiply, so independence makes things multiply. It makes the joint density multiply, it makes the likelihood multiply. So, I had summarize that here in the statement that is the, the likelihood of the parameters given all of the Xs is simply the product of the individual likelihoods. That last point I'd like to make on this slide is that, especially points one and two, the likelihood assumes these interpretations of likelihoods. One negative aspect of them is that you have to actually have the statistical model specified correctly, and of course we don't know the statistical model really ever. If we assume that our data is Gaussian, that's an assumption. It's not generally something we know. Maybe in some rare cases like in, for example, radioactive decay. There are some physical theory that suggests that, that data is Poisson for example. But in most cases, we don't actually know that the statistical family is a correct representation of the mechanism that would generate data, if we were to draw from the population. So, I think the way in which people still rationalize using likelihood based inference in these cases is that they say, well given that we assume this is the statistical model, then we will adhere to the use of the likelihood to summarize the new evidence in the data. Let's go through a specific example. One of the more important examples, and it's very illustrative, so let's do it. Consider just flipping a coin, but this coin, let's say it's a, an oddly shaped coin. Maybe it's a little bent or something like that. So you don't actually know what's the probability of a head. Let's label that probability of a head as theta. And then recall that the mass function for an individual coin flip is theta to the x, one minus theta to the one minus x. Here in this case the theta has to be between zero and one. So if X is zero its a tails and X is one, its a head. So if we flip the coin and the result is a head, then the likelihood is simply the mass function with the one plugged in, right? So in this case we get theta to the one, one minus theta to the one minus one which works out to be theta. So the likelihood function is the line, theta where theta takes values between zero and one. And if you accept our laws of likelihood and likelihood principle and then interpretation of likelihood that I sort of outlined in the previous page, then this says that consider two hypothesis. The hypothesis that the coin's true success probability is 50%, .5 versus the hypothesis that the coin's true success probability is .25. In the light of the data, right? The one head that we flipped and obtained, the question is what is the relative evidence supporting the hypothesis that the coin is fair, .5 to the coin is unfair with the specific success probability of.25, we would take the likelihood ratio which is then .5 divided by .25, which works out to be two. So if you accept our interpretation of likelihoods, this would say there is twice as much evidence supporting the hypothesis that theta equals .5 to the hypothesis that theta equals .25. So that is the idea behind using likelihoods for the analysis of data. Now let's just extend this example. So, suppose we flip our coin from the previous example but instead of flipping it just once we get the sequence one, zero, one, one. I have kind of a funny notation here. I am going to write script L as the likelihood and L is a function of theta. But it depends on the data that we actually observe, one, zero, one, one and so we're assuming our coin flips are independent. And so what happens with like this? Will you take the product? So, here I have the first coin flip theta, to the one, one minus theta to the one minus one. Here I have the second coin flip theta to the zero, one minus theta to the one minus zero and so on. So, I take the product of all of those and you get theta cubed, to the one minus theta, raised to the first power. And that's the likelihood for this particular configuration of ones and zeroes from four coin flips. Notice, however, that the order of the 1s and the 0s, does it matter? Regardless of the order as long as we got three ones and one tail, the likelihood was going to be equivalent. It was going to give you theta to the three, one minus theta to the one. So, that is a property of likelihoods. It's illustrating that, if you have a coin, the particular configurations of zeros and ones doesn't matter. All of the relevant information about the primary figure is contained only in the fact we got a specific number of heads and a specific number of tails. It doesn't depend on the order whatsoever. And in this case, because we know how many coin flips we have, all we need to know is the specific number of heads. So instead of writing likelihood of theta depending on 1,0,1,1, we might write it as likelihood of theta depending on getting one tail and three heads because it's the same thing. It's, the order is irrelevant. This by the way raises the idea of so called sufficiency in this case, you know the number of heads in the total coin flips is sufficient for making inferences about theta. You don't need to know the data actually, all you need to know is the total number of heads and the number of coin flips. So often, that total number of heads, you know, conditioning on the fact that we know the total number of coin flips, is called a sufficient statistic. It's saying that there's a reduction in the data, to make inferences about the parameter, you only need to know a summary of it, a function of it. And in this case a function that we need to know is the sum, total number of heads. So let's do a likelihood calculation again and let's take the likelihood supporting the coin is fair that theta is .5 and divided by the likelihood assuming that the coin is unfair specifically with a 25 percent chance of heads and we get the ratio of 5.33. So in other words, there's over five times as much evidence supporting the hypothesis, that theta is.5 over the hypothesis that beta is .25. Now, that relative values of likelihoods measure evidence. Well, that's useful but we're not particularly interested in say, .25, I mean the .5 is kinda interesting cuz the coin is fair. But you know, most of the other points, you know, we're not interested in .25 anymore whether we're more interested in .24 and so on. So we'd like a way to consider likely at ratios of all values of the, parameter theta. And this is just simply a likelihood plot, that simply plots theta by the likelihood value. And remember that likelihoods are really interpreted in terms of relative evidence. So it's the fact that the ratio of the likelihood of .5 to the likelihood of .25 is five is what's saying that we have five times as much evidence. So it actually doesn't matter. Constants that don't depend on theta don't matter in the likelihood, right? Because when you take the ratio of, if, if there's a constant that doesn't depend on theta. If it's in the numerator and the denominator it'll just cancel out. The likelihood in their interpretation should be in variant to constants that are not a function of the parameter. So because of that the raw absolute value of the likelihood isn't all together that informative, so we need to pick a rule for kind of normalizing it, and so why don't we just divide it by it's maximum value so that it's height is one. And that seems to be a pretty reasonable rule, and it helps with interpretations, I think. And again I just want to reiterate this last point. Because likelihoods you know, if you're going to buy into this sort of likelihood paradigm of interpreting likelihoods, everyone agrees that they measure relative evidence rather than absolute evidence. So, you know dividing the curve by its maximum value or any value, you know, it doesn't change it's interpretation. It's actually an interesting question I might add to try and think of could someone create an absolute measure of evidence in statistics and I'm not aware of any. But, so we'll have to stick to relative measures for right now. So, here on the next page is a likelihood plot. We have theta on the horizontal axis, and we have the likelihood value with the maximum of one on the vertical axis. In this case, this is exactly the likelihood for the four coin flips that we saw. So the peak value is one and as the likelihood goes down, those values of theta are worse and worse supported. Now, it's kind of interesting that the peak value right? The likely value at which we divided the likelihood by is the best supported point in the data. So that's kind of interesting, right? Because that point has the highest likelihood value, so no matter what you divide it by, you're always going to get a likelihood ratio bigger than one. So that point seems kind of special, and in fact, we give it a name, we call it the maximum likelihood point. And it's a maximum likelihood that turns out to be a very useful technique, and in fact, you might not know this but the vast majority of statistical estimators will use maximum light estimators are very close to them. So the way you would interpret this plot, for example is, take any two points, say if you take point four, you get a height at point four, and take the value, say, point six, and you get a height at point six, the ratio between those values is the relative evidence. And then because we divided by the maximum, every value that we look at is the relative evidence of that specific value of data when compared to the point that is best supported by the data, the maximum likelihood point. So the fact that here, the value of a fair coin .5, well, it actually has, surprisingly, a likelihood value of about .5. Which means that the hypothesis, if you were to divide the likelihood at the maximum likelihood value by the likelihood at the fair coin value, you get a ratio of about .5. And that gives the relative evidence supporting the point for which you have the maximum likelihood, which turns out to be .75 relative to .5. So we might draw a horizontal line and let's say we drew a horizontal line at one eighth. I think that's what this top line is at, it's at one eighth. What does that mean? For every point that falls between the end points of this line is such that there's no other point that's more than eight times better supported. So take this point where the curve meets this line. That's exactly one eighth. What does that mean? That point is exactly eight times worse supported, given the data, then .75 which is the maximum likelihood value. And take any point in this intervals that falls between the ends of this line, and you can't find another point that's more than eight times better supported. Take for example 0.4, it has a likelihood value that it's about 0.3 or whatever, but its ratio relative to the maximum is less then one eighth. So, it's ratio with everything else is going to be less than that, so you are not going to be able to find for point four, another point any where on this curve that's more than eight times better supported in it. So, that's the idea behind drawing a line at say one eighth or whatever if you draw a line the collection of data values that lie on that horizontal line. Between the points we're going to set the likelihood curve or well supported. And of course as you draw the line as you go up and up and up fewer points stay in the interval to the point where if we draw it high enough then you only have the maximum likelihood value surviving the threshold. So, just to reiterate some of the points we made on the previous slide the value of data where the curve reaches it's maximum is that maximum likelihood estimate and if we want it to write out mathematically then MLE is the argument maximum over the theta of the likelihood having plugged in the data X. And a nice interpretation of the MLE is it's the value of the parameter that would make the data that we observed most probable. So in this case we have three heads and one tail. And the question is, what's the success probability of the coin that we could pick that would make the data that we observe most probable? And that's a nice interpretation of the MLE as well. Well it turns out, and I think I've eluded this because I kept saying that the likelihood MLE in the previous example was .75, how did I get that? Well, there was three heads out of four tails, so that's a proportion of heads of .75. Well, it turns out that if you have independent identically distributed coin flips, then the MLE for theta is always the proportion of heads that you get. And again, I think if anyone were to give a single point estimate for the success probability of that coin, they would all give the proportion of heads. So, to be honest, the fact that maximum likelihood yields that is not so much a booster for using the proportion of heads as an estimator, it's more fact that it motivates the use of MLE perhaps in more complicated settings where we don't have great intuition already as to what the kind of logical estimator should be. That's kind of the benefit of MLE is that in a lot of the cases where we have a really good idea of what should be the right estimator, the MLE returns estimators that exactly near our intuition. And that gives us some hope that it would be a useful thing to do in these settings where we don't know what the best estimator is. Then in addition to that there's been, I think it's fair to say tomes of theory that have been developed in support of MLEs as you let's say for example the number of data points go to infinity. So let's actually just prove this fact that if you have a binomial or you have n Bernoulli coin flips that the maximum likelihood estimator is the number or the proportion of heads so let n be the number of trials, and let's let x be the number of heads. And remember that in this case the likelihood is theta to the x, one minus theta to the n minus x. So, theta to the number of heads, one minus theta to the number of tails. And then, if we want to find the argument maximum of this function, it turns out it's definitely easier to maximize the log likelihood. And, this is almost a general principle in statistics that when you have a bunch of independent things and you want to maximize a likelihood your better off maximizing the log likelihood. You know because if you maximize the log of the function, you've maximized the function because log is an increasing monotonic function. And then in addition to that, the fact that you have a bunch of independent things means that you've multiplied a bunch of things to get the joint density or mass function. So if you multiply things, things get raised to powers and so on and these are all kind of complicated things to work with, addition is much easier to work with, so log kind of converts products into sums, and that's really quite useful. So x, which is a power, is now just is no longer a power on the log scale. And you can x log theta plus n minus x log one minus data, which is a much easier function to work with. But you know in this case you can do it either way. It's no problem. But one of the reasons it helps in general is that it takes care of the annoying products that you get from independence in multiplying a bunch of densities or mass functions together. If we take the derivative, we get X over theta. N minus X over one minus theta. And if we want to solve for the, inflection point, we'd set this equal to zero, and we'd set it equal to zero, and I'm not going to churn through the calculations. If you actually set it equal to zero and just bring the two terms on either side, it's pretty clear that theta equal to X over N solves that equation. You have one minus X over N times theta plus, equal to one minus theta times X over N. It's pretty clear if you plug in X over N, that, you're going to get a valid equality there. But you can turn to the calculations and get to the fact that theta solved it x over n. So in other words the value of theta that makes the observe data most likely in IID Bernoulli trials is the proportion of heads. Oh and, and below I, I checked that second derivative condition to make sure the likelihood is concave. So, you know, this technically doesn't handle if you got all failures or all successes, but maybe just do those cases on your own. So what constitutes strong evidence? If you're going to treat the likelihood as our arbiter of evidence and likelihood ratios as measures of evidence, we would like to maybe build up some intuition and you know, a friend and a faculty member here taught me this idea of, why don't we just use this kind of coin flipping as the mechanism for building up our intuition, through its strength of evidence. So imagine an experiment where, a person is considering three possible hypotheses with coin flipping. The coin has tails on both sides, in other words theta equals zero. The coin is fair theta equals .5 versus the coin has heads on both sides, theta equals one. And so here we have hypothesis one, hypothesis two and hypothesis three and now I have this table for the possible outcomes. So lets suppose I flip the coin in its head. And I've done this experiment, unfortunately it's difficult to do this experiment in this setting. But I've done it in class, and you just have to take my word for it. On one coin flip, pretty much or no one is willing to ditch the hypothesis that the coin is fair. So, in one coin flip, suppose you get a head, right? The probability of a head, even the first hypothesis, that the coin has tails on both sides is zero. The probability of a head, given that the hypothesis that the coin is fair is .5, and the probability of a head given the hypothesis three that the coin is two headed, then that's one. So the likelihood ratio of hypothesis one to hypothesis two is zero, and the likelihood ratio for hypothesis three relative to hypothesis two is two. And of course, this is exactly what we would hope to happen right? If the coin is two tailed, that can't produce heads so that if we get one head, it should have a likely ratio of zero for supporting the two tailed hypothesis. Okay, two, there is twice as much evident for supporting the hypothesis that the coin is fair and the coin is two headed given a single coin flip that is a head. So, it's clear that two is not terribly strong evidence especially given either way. If you flip the coin ones, something is going to happen. So that, 50 percent probability of getting head is not that compelling. So let's suppose we have two heads in a row, okay. Now I am, I am quit talking about hypothesis one, because you can't get two heads in a row under hypothesis one. But here I've outlined all the different possibilities head, head, head, tail, tail and tail, tail and I go to the likelihood ratio for all of them. But in this case it's .25 or two consecutive heads. If the coin is fair, and it's 100 percent of getting two heads if the coin is two headed so the likelihood ratio is now four, four times as much evidence supporting the hypothesis that the coin is 2-headed then the hypothesis that the coin is fair if you get two consecutive heads. Now let's suppose we get three consecutive heads and then the probability of getting three heads if the coin is fair is .125. Probability of getting three consecutive heads if the coins is two headed is 100%. You get a likelihood ratio of, of eight and in this case that means that there's eight times as much evidence supporting the hypothesis that the coin is two headed relative to the hypothesis that the coin is fair. And, so let me tell you what happens when I do this in a class. I have a two headed coin and I play this game. And people are willing to keep considering the hypothesis that the coin is fair. Because, I guess because, most of the time, people aren't aware that two headed coins are easy to buy. So, around three heads, a substantial fraction of the class has started to believe that the coin is two headed now. Four consecutive heads, which would then the likelihood ratio would be sixteen of course. Five consecutive heads, it would be 32 and so on. By four consecutive heads, the vast majority of the class is believing that its two headed. And then by five consecutive heads, basically 100 percent of the class always agrees that its two headed. And I've done games where I have a fair coin and an unfair coin. And I show the class that one of them's fair, and one of'ems unfair. Jumble them up in my hand, and then they don't know which one I'm flipping so that they know I'm not trying to trick them. Well, I am trying to trick them, but I, not in an obvious way I'm trying to trick them. Actually, you can kind of tell by the weight which one's fair, and which one's not so I always grab the unfair one. These create sort of useful benchmarks, right? So, A8, you know, as sort of being moderate evidence. So the idea behind this is to use this coin flipping, and the easy experiment that we can understand to build up context for what likelihood ratios mean. So eight, sort of moderate evidence, it's sort of like getting three consecutive heads on a coin, right? Sixteen is being strong evidence, it's like getting four consecutive heads and the evidence against the coin being fair, and then 32 is being quite strong evidence. And, you know these are, admittedly, the coin is used for contacts. But these are no more arbitrary, say, than, the existing threshold that, say, is used on p values where people just arbitrarily pick five percent as their cut off for type one error rates if you are aware of this sort of thing. So any rate this is why for example I draw lines, likelihood plots at the vaue of one eighth, and so that way parameter values above the one eighth reference line are such that no other point is more than eight times better supported given the data. That's the end of the kind of technical component of this lecture, I wanted to spend a little bit of time just talking about the consequences of kind of adopting this style of analysis. So, pretty much every major paradigm in statistics, Bayesianism, frequentism, likelihood, this likelihood paradigm, pretty much every paradigm in statistics agrees if you assume a probability model and act as if it's true, then the likelihood ratio is a central component to the theory. If you take enough mathematical statistics, you'll see this. The particular paradigm that I'm discussing today then goes beyond this relatively benign use of likelihood ratios that occur in the other areas. What I'm talking about today right, is that not just that the likelihood ratio is useful but that likelihood ratios measure relative evidence and that given a statistical model on observed data, all of the relevant information is contained in the likelihood. And this has kind of far reaching consequences to the field of statistics. If you go beyond just saying likelihoods are useful, to going to say not only are the useful but they have these properties, then it changes quite a bit of statistics. So, for example, much of statistics is devoted to things like hypothesis testing and P values and other variance of statistics with the interpretation of the statistics involves potentially fictitious repetitions of an experiment. So, for example if you've ever heard of a confidence interval, the interpretation of a confidence interval is quite confusing, but it's something along the lines of if you were to use this technique over and over again you would obtain these intervals that contain the things they were trying to estimate say 95 percent of the time. Well, if you kind of adopt this strong variant of interpreting likelihoods, then that suggests that, that interpretation can't be valid because it involves potentially fictitious repetitions of the experiment which do not depend on the likelihood, for the data at hand. So it cannot possibly be useful or it cannot have any additional evidence. So some of the things that get disputed if you adopt this paradigm or p values, hypothesis testing, multiple corrections, and these are the big ones that come to the top of my head, which is very disputed because in many ways, these techniques seem very central to the idea of statistics. So, I really just wanted at this point to introduce people to these concepts, and state the consequences of this theory. I think for the purposes of this class, what I would hope you would know after this lecture is what the likelihood is. You would know that regardless of what kind of paradigm statistics you're in, the higher likelihoods generally, refer to better supportive values of the parameter. And I would hope that you understand about the principle of maximum likelihood. Thank you for listening. This was Mathematical Biostatistic Bootcamp Lecture six and I look forward to seeing you for the next lecture.