Hi. My name is Brian Caffo, and this is Mathematical Biostatistics Boot Camp. Lecture fourteen on logs. In this lecture we're going to talk about what it means to log data and what impact that has when you do things like take arithmetic means of logged data and create confidence intervals and this sort of thing. So we'll talk about logs, we'll talk about the geometric mean, which is intrinsically related to taking logs of data and taking arithmetic means. And we'll about the geometric mean and it's relationship with the law of large numbers and the central limit theorem. And then we'll go through some of the existing techniques that we've already gone through, like creating T confidence intervals. But just go over how they're interpreted with respect to log data. And then we'll finish talking about the log normal distribution. So just to remind everyone a little bit about logs. Log base b, of the number x, is the number y, such that b to the y, equals x. And log base b of one is always going to be zero because b to the zero is going to equal one. And then this log base b, will always travel to minus infinity as x travels to zero. And then, you know, for the class, we've always been writing just log as when the base is E for Euler's number, and sometimes people tend to write LN in that case. There's basically only three bases for logs that people ever use. Base E has a lot of nice mathematical properties. Base ten is nice because then log speaks of orders of magnitude, right. Log base ten of ten is one. Log base ten of 100 is two. Log base ten of 1,000 is three and so on. And then log base two is often very useful as well. Because it's a smaller number than ten, you get lower powers, it's often useful. And just to remind everyone log AB is log A plus log B, log A raised to the B power is B log A, and log A divided by B is log A minus log B. In other words, log turns multiplication into addition, division into subtraction, and powers into multiplication. So hopefully none of this is news to you. So that's sort of the mathematical properties of the log. But statistically, why do we take logs of data? The most common reason to take a log of data is if the data is sort of skewed high. And what I mean by that, for example, is incomes are a great traditional example of things that tend to be skewed high. You have a lot of people making very little money and a handful of people making a lot of money. And so that distribution looks like a hump towards zero. And it spreads out with a long tail towards high values. And sometimes people, you might take logs of. Income data to try and make it look more bell-shaped. This occurs frequently in biostatistics for example, health expenditures. A lot of people tend to spend very little on healthcare until healthcare becomes a problem, then they spend a lot. So distributions like healthcare expenditures and other things like that tend to be right skewed especially because they're bounded from below by zero. In setting where errors are, feasibly multiplicative. When dealing in things like concentrations and rates, then it's. Natural to take logs because then it turns that multiplication into addition. Whenever you're considering ratios. It's useful to take logs because then instead you have differences rather than ratios. And then if you are dealing with something where you're not so concerned about the specific number but more concerned about orders of magnitude, say something like using log base ten, as an example if you are considering astronomical distances you might be just more concerned with the orders of magnitude rather than the actual specific number. Then you might often take logs. And then, counts are often logged if your data are the number of say, infections at a hospital or something like that. You might log data like that. Notice if you have logged several counts and one of them is zero then you have a problem with taking logs so you have to come up with some solution for that. So, let me talk a little bit about the geometric mean. The sample, I, I say the sample geometric mean, just so we're using the same notation when we talked about the sample mean of data. The sample geometric mean of data set X1 to XN is, you take the product of the observations, pie I equals one to N, XI, and then you raise it to the one over nth power. And notice, if all the Xs are positive, which is generally the case if you're thinking about geometric means. Then the log of the geometric mean, is then the arithmetic mean. One over N summation log XI. So it's the arithmetic mean of the log of observations. So let me just repeat that. The log of the geometric mean is the arithmetic mean of the log observations. So because, the log of the geometric mean is the arithmetic mean of the log observations. On the log scale, the geometric mean has all these properties that we already talked about associated with sample means, sample arithmetic means. So the law of large number applies and the central limit theorem applies. I have parenthesis here that says, under what assumptions. But under whatever the assumptions applied for the arithmetic mean. To have the, the law of large numbers and the central number theorem. The geometric mean is always less than or equal to the, the sample or arithmetic mean, as a just general property. So let me just give you a quick example of using geometric means. In some domains people use the geometric mean so frequently than when they talk about the mean they, they're referring to the geometric mean, not the arithmetic mean. So, as an example, when what you're thinking about is inherently multiplicative, you would often think of the geometric mean. So suppose that in a population of interest, the prevalence of the disease rose two% one year. And then the next year it fell one%, and then the year after that it rose two%. And then it rose one percent again. Well if you were thinking about what's the end prevalence of the disease after the starting prevalence, inherently you would multiply the starting prevalence x 1.02 x.99 x 1.02 x 1.01, and you would get the ending prevalence. So the geometric mean of these collection of increases and decreases would be a relevant quantitative to study. And so that geometric mean would be the product of them raised to the one-fourth power. And what's interesting about that, then, is. If you take the starting prevalence and you multiply it, times 1.02, .99, 1.02 and 1.01, you get the ending prevalence after the four years. If you take the geometric mean and multiply it, times the starting prevalence four times, you get the same number. So that's what that the geometric mean is, considered in the sense of the arithmetic mean. The arithmetic mean is the number you would have to add four times to get the same end result. The geometric mean is the one you have to multiply, or times to get the same result. And that's why it's useful. So if you're thinking about things that are inherently multiplicative like, percent increases and decreases. Then it's common to take the geometric mean. So if you work in certain financial sectors for example, if they say mean, they are referring to the geometric mean because it's obviously more natural to talk about. Okay, so just re hashing some of these points. Multiplying the initial prevalence by 1.01 to the fourth power. Than otherwise, multiplying it by four times is the same thing as multiplying by the original four numbers in sequence. So 1.01 is the constant factor by which you would need to multiply the initial prevalence each year to achieve the same overall increase or decrease in prevalence over a four year period. Take that in contrast to the arithmetic mean and that's the factor by which you would have to add to achieve the same total increase. And in this case, it's clear to me at least, that the geometric mean makes a lot more sense that the arithmetic mean to talk about. On the next slide, I was thinking about how to explain this. I googled the geometric mean and the arithmetic mean and I found this great example at the University of Toronto's website and it has a really fun geometric interpretation of the arithmetic mean and the geometric mean. So if you have a rectangle and A and B are the lengths of the sides of a rectangle. Then the arithmetic mean A plus B over two is the length of the sides of the square that has the same perimeter as the rectangle. The geometric mean A times B to the one-half is the length of the side of the square that has the same area. So if you're sort of interested in multiplicative things like areas, you want the geometric mean of the sides. If you're interested in additive things like perimeters, you want the arithmetic means. So it's, I thought that was really cool when I read that. So, back to statistics, the log of the sample geometric mean is just an average. And so provided the expected value of log x exists. Then that average has to converge just by the law of the large numbers to what, I'm defining here as mu equal to the expected value of log x. To remember the log of the geometric mean is, is itself just an arithmetic mean. We have the law of large numbers which tells us, what the arithmetic mean converges to? It converges to the population mean. So therefore, the log geometric mean converges to, expected value of log x, where x is a draw from the, original, population, on the natural, unit scale. Not on the log scale. Therefore, if you want to know what the geometric mean converges to, The geometric mean is the, exponential of the log of the geometric mean. Of course, because E to the log x, is x. So it would be nice if that worked out to be expected value of x. But it's doesn't because E to the expected value of log x, the exponent can't move inside this expected value. So we get something, which I'm going to just call, E to mu, is, is exactly E to the mu. And this is not expected value of x. And this quantity, E to the mu which is the exponent of the expected value of log x. It doesn't really have a name. But I like to call it the population geometric mean cuz you know, if you have that the sample arithmetic mean converges to the population mean. The sample variance converges to the population variance. The sample median converges to the population median. Then by that logic, the sample geometric mean should converge to something that's called the population geometric mean. So, I'm going to call it that. I, I don't see that to often in books, but what the heck, I'm going to do it. So to reiterate, the exponent of the expected value of log X is not equal to the expected value of the exponent of log X, which is equal to E to the X. So, what I'm referring to as the population geometric mean, is not equal to the population mean that we defined earlier. It is however interesting to note, that if the distribution of log X is symmetric. Which, remember, that was one of the reasons at the beginning of the lecture, we stated for taking logs of data is to turn skewed data to data that's more symmetric. Then if the distribution of log X is symmetric, then consider the median. The median is the point where point five is equal to probability that log X being less than or equal to, to mu. And in this case, because log X is symmetric, mu, the mean on the log scale is in fact also the median. So this first statement, point five equal to the probability of log X less than or equal to mu, is just reiterating the statement that for this distribution that's symmetrical on the log scale, the mean and the median on the log scale are equal. But, now on the interior of this probability statement we can, because everything's positive. And because the E function is monotonic. We can take an exponent on both sides of this inequality, and get that the probability of the x on the natural scale, not on the log scale. But x on the natural scale. The probability that x is less than or equal to E to the mu is also 50%. So the conclusion is that, for log symmetric distributions, the geometric mean is estimating the median. So why am I saying all this, I am making fairly simple ideas rather complicated. The idea is you have data, you log it and you just do all the normal stuff you do with your data, you just using it on log data. And what I'm trying to say is, I'm trying to relate the quantities that you get from doing that. They have interpretations back on the natural scale, that's what we're trying to say. You don't have to discard the natural scale units when you log data, you get a lot of interesting interpretations back on the natural scale. So, at any rate, if you use the central limit theorem to create a confidence interval for the log measurements. Then your interval is estimating mu, the expected value of the log measurements expected value of log x for log units. Then if you expodentiate the interval, then you're estimating E to the MU. The population geometric mean, as I'm calling it. And then in the event the distribution of the log data is itself symmetric. Then your exponentiated interval is also estimating the median. So this is kind if a back handed way of getting the confidence interval for the median. If you're willing to assume that your data is symmetric the population. From which your data is drawn, it's symmetric on the log-scale. Then when you take the log of the data, create the confidence interval and then exponentiate the end points, then you wind up with a confidence interval for the median. And remember, we also talked about getting a confidence interval for the median using bootstrapping, but this is a lot easier, it just uses the ordinary T confidence interval. And then this is especially useful for paired data when their ratio is of interest. So, let's just quickly go through an example. So, remember, I quoted before this book by Rozner, Fundamentals of Biostatistics, which I like. It's very thorough. And covers a huge chunk of biostatistical topics. Any rate, on page 298 of the version that I have, which unfortunately I think was the previous version than the current one. It gives a paired design where it compared systolic blood pressure for people taking oral contraceptives and matched controls. And so paired design is where you have a person and you have a bunch of covariates that you're concerned with, when you want to compare, say, oral contraceptive use to controls. You're worried that the group of people that take oral contraceptives are, different than the group of people who don't take oral contraceptives. So, what you might do is you might take this, list of, things that you, think might, explain that, difference, and match on'em so that, the person taking the oral contraceptive, they have a, twin in a sense, in the, control group, that, at least insofar as the, other variables you can measure, they're very close. That's this idea of matching. The matching to the extreme., You know, you couldn't do it in this experiment, but imagine if you were investigating aspirin. You would, say, give a person an aspirin and then after a suitable wash out period, give them a placebo. And then that person would be perfectly matched to themselves as their own control. So that's the extreme version of this case, but let's suppose you're in a circumstance like this where you can't really, randomize people to contraceptive use. You couldn't do crossover experiment like that. So you could match people as closely as you could on all these other things that you think might differentiate contraceptive users from controls. And match them as closely as possible. Anyways, that's a match design. But the point for our discussion, is that person one. Who is in the oral contraceptive group, and person one, who is in the control group. They are tied together. And so we want to utilize that information that they're similar. So what we might do is take the blood pressure for person one. From the oral contraceptive use. And the systolic blood pressure from person one in the control group. And analyze their ratios, right? And so we might be interested in ratios, because we just might be interested in the interpretation of, well, what percent increase and decrease does a person in the contraceptive group have over their associated controls. So imagine if we took ratios, and then logged the ratio. Well, that would just be the log difference of the two measurements. Then we could just do an ordinary one sample t conference interval for the log of the ratios done matched pair by matched pair. And so in this case, the geometric mean of the ratios works out to be 1.04. Which in this case the order in which I was dividing implied to four% increase in systolic blood pressure for the oral contraceptive users. And t interval on the log scale. So when I took each measurement, an oral contraceptive user, log the control user, took the difference, pair by pair. I wound up with then, n measurements where I started with 2N total measurements, each in pairs. I had my n measurements on the log scale, I had an ordinary t interval and I calculated it and I got 0.010 and 0.067. In this case, the units would be in log, millimeters, or mercury. What we're interested in on a log scale is whether zero is in this interval or not, right? Zero is the important thing on the log scale. If we exponentiate the interval, we get 1.01 to 1.069. So an estimated via 95% confidence interval, one% to seven% increase in systolic blood pressure for the oral contraceptive users relative to the controls. And so on the exponentiated scale we're interested in whether one is in the interval. On the log scale, we're interested in whether zero is in the interval. By the way if your numbers are kind of small, like in this case 0.01 and 0.067. Exponentiating is about like one plus and if you are math person you take the Taylor expansion of e to the x and go out one term and you see that it is pretty close to one plus. You can actually exponentiate things very quickly but just by taking one plus and then obviously if you, number that you're looking at is pretty close to one and you want to log it, you can do one minus and you same thing take the Taylor expansion for log and go out one term. You can see, if the number is close to zero and you want to exponentiate it one plus works pretty well in approximation if the number is pretty close to one. And you want to log it. One minus does pretty well as well. That's a, trick that's very useful, like when you do, logistic regression and things like this where you need to, take exponents, quickly. So let me just talk about this example Just a little bit more. This estimate, 1.01 to 1.07. This one% to seven% estimated increase between the two groups. That is a conference interval for this sort of paired ratio of geometric means. And that's why it's useful in that we're estimating a ratio here. So now let's just go through the same exact exercise but instead of when we have parrot observation, we have two independent groups. If you log the data from group one, log the data from group two, create a confidence interval for the. Difference in the group means on the log scale, and then exponentiate it, then what your estimating, that confidence interval is an estimate of e to the mu one divided by e to the mu two, that confidence interval is exactly an estimate of the ratio of the population geometric means. Of course it's an estimate on the log scale of the difference in the expected values on the, the mean on the log scale. But if you exponentiate it, then you get exactly an interval for the ratio of the population geometric means. And if you're willing to assume that the data is symmetric in the log scale, then this is also equal to a ratio in the population medians. There's one distribution where take logs of things and they wind up as Gaussian is so important that we give it a name, we call it the log normal distribution. And a random variable's log normally distributed if it's log is a normally distributed random variable. Note, it's not the log of a normal random variable as it's name kind of implies. You can't take the log of a normal random variable because those can be negative and you can't take the log of them. So if you want to remember what's a log normal random variable, remember this phrase. I am log normal means think logs of me then I'll be normal and then you'll remember the correct order. But then also think when you are assuming what the log normal is, if you are taking the log of something that's possibly negative then you're doing it wrong. Okay so again log normal random variables are not logs of normal random variables. As I stay here you can't even take the log of normal random variable because it can be negative. So formally, X is lognormal and it depends on two parameters, mui and sigma squared. If log of X is normal mui comma sigma squared, and again that mirrors kinda what we're often doing with logs. We're trying to take logs of things. So that, on the log scale, the data is symmetric. And then hopefully, the population distribution is also symmetric. And if log of X is normal, for X being log normal. Then if Y is normal, mue, comma, sigma squared. Then E to the Y is log normal. So you can generate a log normal by generating a normal random variable and exponentiating it. I give you the log normal density here. If you want to, it depends on the mu and the sigma squared. Its mean is E to the mu. Plus sigma squared over two where mu and sigma squared are these mean and variance on the log scale. And the variance is two mu plus, sigma squared times e to the sigma squared minus one. And its median, is e to the mu. And of course, it's geometric, what I'm calling its population geometric mean is E to the Mu as well. So you can see here, this gives you an exact example, where expected value of X. And E to the expected value of log X are two different things. Expected value of X in this case, when X is log normal, is E to the Mu plus sigma squared over two. E to the expected value of log X is E to the Mu. Okay. So if X1XN are log normal MU sigma squared. Then log X1 to log XN, where I'm calling this YN, Y1 up to YN, are normally distributed with mean mu and variance sigma squared. So they satisfy the conditions to create a T confidence interval. And then mu is the log of the median of the XI. E to the mu then gives the median on the original scale. It also gives you the population geometric mean. And then, again, assuming log normality in exponentiating T confidence intervals for the difference in two logs. Two log, again, implies that your confidence interval is estimating ratios of geometric means. So, let's just go through a quick example of doing this. Now I'm assuming you can do the arithmetic of this because you already know how to create two group T confidence intervals. So all that we're doing is logging the data and doing something you already know how to do. So I just want to go through the interpretation real quick. So imagine if you took gray matter volumes. I actually did this for some data that I have. I have brain gray matter volumes for a young and an old group defined as younger than 65 and older than 65. But of course this doesn't account for being young at heart or whatever. Young and old, as per my definition, but if you're 65, rest assured, I don't think you're old. It's just the definition I'm doing here. So we did two separate group intervals. And for the old group got 13.24 to 13.27. And for the younger group got 13.29 to 13.31. Both of them are in the units of log cubic centimeters. If you exponentiate those intervals you get, 564 for one group and 578 about for the other one. And 592. To, 606 cubic centimeters for both. For old and young, respectively. So both of these. Intervals estimate the population geometric mean, gray matter volume among the older and younger groups respectively. If we're willing to assume that the population of brain volumes on the log scale are symmetric then both of these intervals estimate the population median gray matter volume for old and young respectively. Then if we were to take the two groups and do a two group T-interval on the log measurements yields 0.032 to 0.066, log cubic centimeters, expedentiate this, you get an interval of 1.032 to 1.068, you know, again, remember the trick, you add one when you expedentiate a, close to zero. You wind with about a three% to seven% higher. Geometric mean brain volume among the younger group than the older group or if we're talking about medians, if we're willing to assume that individual populations are symmetrically distributed, then that would be estimated between three and seven percent. Increase in grey matter volume for the younger group. This, of course, being the case because as we age, we start to lose a little bit of grey matter volume over time. Of course, you develop more neuronal connections, so you get wiser. So you have, maybe, more neuronal connections, but decrease in volume. So, anyway, what I hope you learned from this was when you take logs of measurements and do what we talked about in terms of creating confidence intervals, and exponentiate the intervals. I hope you know what the estimates are then referring to. And it's a common problem, people do this all tie time. But I"m not sure if people always understand exactly what they're doing. And that's why I devote an entire lecture to the subject of logging which is, in practice, is a trivial extension of what we've already done. Take logs of your data, do what we already do, and then exponentiate the intervals. So no change in what we're doing. But I wanted everyone to understand exactly what the implications of those things were. And why log is sort of special in the sense it yields uniquely interpretable. Results as opposed to doing other functions. You could say, take cube root of the data, create the confidence interval on the cube root scale and then. Raise the interval to the third power. And you wouldn't get the same nice interpretations like you do with log. Log is special that way. Alright, well thanks troops. This was our last lecture. I hope you enjoyed the class. And hope you survived the intense biostatistical training. And I hope you go on to do great things with this knowledge. And all the other courses you take from Corsara.