0:09

In this lecture we're going to talk about what it means to log data and what impact

Â that has when you do things like take arithmetic means of logged data and create

Â confidence intervals and this sort of thing.

Â So we'll talk about logs, we'll talk about the geometric mean, which is intrinsically

Â related to taking logs of data and taking arithmetic means.

Â And we'll about the geometric mean and it's relationship with the law of large

Â numbers and the central limit theorem. And then we'll go through some of the

Â existing techniques that we've already gone through, like creating T confidence

Â intervals. But just go over how they're interpreted

Â with respect to log data. And then we'll finish talking about the

Â log normal distribution. So just to remind everyone a little bit

Â about logs. Log base b, of the number x, is the number

Â y, such that b to the y, equals x. And log base b of one is always going to

Â be zero because b to the zero is going to equal one.

Â And then this log base b, will always travel to minus infinity as x travels to

Â zero. And then, you know, for the class, we've

Â always been writing just log as when the base is E for Euler's number, and

Â sometimes people tend to write LN in that case.

Â There's basically only three bases for logs that people ever use.

Â Base E has a lot of nice mathematical properties.

Â Base ten is nice because then log speaks of orders of magnitude, right.

Â Log base ten of ten is one. Log base ten of 100 is two.

Â Log base ten of 1,000 is three and so on. And then log base two is often very useful

Â as well. Because it's a smaller number than ten,

Â you get lower powers, it's often useful. And just to remind everyone log AB is log

Â A plus log B, log A raised to the B power is B log A, and log A divided by B is log

Â A minus log B. In other words, log turns multiplication

Â into addition, division into subtraction, and powers into multiplication.

Â So hopefully none of this is news to you. So that's sort of the mathematical

Â properties of the log. But statistically, why do we take logs of

Â data? The most common reason to take a log of

Â data is if the data is sort of skewed high.

Â And what I mean by that, for example, is incomes are a great traditional example of

Â things that tend to be skewed high. You have a lot of people making very

Â little money and a handful of people making a lot of money.

Â And so that distribution looks like a hump towards zero.

Â And it spreads out with a long tail towards high values.

Â And sometimes people, you might take logs of.

Â Income data to try and make it look more bell-shaped.

Â This occurs frequently in biostatistics for example, health expenditures.

Â A lot of people tend to spend very little on healthcare until healthcare becomes a

Â problem, then they spend a lot. So distributions like healthcare

Â expenditures and other things like that tend to be right skewed especially because

Â they're bounded from below by zero. In setting where errors are, feasibly

Â multiplicative. When dealing in things like concentrations

Â and rates, then it's. Natural to take logs because then it turns

Â that multiplication into addition. Whenever you're considering ratios.

Â It's useful to take logs because then instead you have differences rather than

Â ratios. And then if you are dealing with something

Â where you're not so concerned about the specific number but more concerned about

Â orders of magnitude, say something like using log base ten, as an example if you

Â are considering astronomical distances you might be just more concerned with the

Â orders of magnitude rather than the actual specific number.

Â Then you might often take logs. And then, counts are often logged if your

Â data are the number of say, infections at a hospital or something like that.

Â You might log data like that. Notice if you have logged several counts

Â and one of them is zero then you have a problem with taking logs so you have to

Â come up with some solution for that. So, let me talk a little bit about the

Â geometric mean. The sample, I, I say the sample geometric

Â mean, just so we're using the same notation when we talked about the sample

Â mean of data. The sample geometric mean of data set X1

Â to XN is, you take the product of the observations, pie I equals one to N, XI,

Â and then you raise it to the one over nth power.

Â And notice, if all the Xs are positive, which is generally the case if you're

Â thinking about geometric means. Then the log of the geometric mean, is

Â then the arithmetic mean. One over N summation log XI.

Â So it's the arithmetic mean of the log of observations.

Â So let me just repeat that. The log of the geometric mean is the

Â arithmetic mean of the log observations. So because, the log of the geometric mean

Â is the arithmetic mean of the log observations.

Â On the log scale, the geometric mean has all these properties that we already

Â talked about associated with sample means, sample arithmetic means.

Â So the law of large number applies and the central limit theorem applies.

Â I have parenthesis here that says, under what assumptions.

Â But under whatever the assumptions applied for the arithmetic mean.

Â To have the, the law of large numbers and the central number theorem.

Â The geometric mean is always less than or equal to the, the sample or arithmetic

Â mean, as a just general property. So let me just give you a quick example of

Â using geometric means. In some domains people use the geometric

Â mean so frequently than when they talk about the mean they, they're referring to

Â the geometric mean, not the arithmetic mean.

Â So, as an example, when what you're thinking about is inherently

Â multiplicative, you would often think of the geometric mean.

Â So suppose that in a population of interest, the prevalence of the disease

Â rose two% one year. And then the next year it fell one%, and

Â then the year after that it rose two%. And then it rose one percent again.

Â Well if you were thinking about what's the end prevalence of the disease after the

Â starting prevalence, inherently you would multiply the starting prevalence x 1.02

Â x.99 x 1.02 x 1.01, and you would get the ending prevalence.

Â So the geometric mean of these collection of increases and decreases would be a

Â relevant quantitative to study. And so that geometric mean would be the

Â product of them raised to the one-fourth power.

Â And what's interesting about that, then, is.

Â If you take the starting prevalence and you multiply it, times 1.02, .99, 1.02 and

Â 1.01, you get the ending prevalence after the four years.

Â If you take the geometric mean and multiply it, times the starting prevalence

Â four times, you get the same number. So that's what that the geometric mean is,

Â considered in the sense of the arithmetic mean.

Â The arithmetic mean is the number you would have to add four times to get the

Â same end result. The geometric mean is the one you have to

Â multiply, or times to get the same result. And that's why it's useful.

Â So if you're thinking about things that are inherently multiplicative like,

Â percent increases and decreases. Then it's common to take the geometric

Â mean. So if you work in certain financial

Â sectors for example, if they say mean, they are referring to the geometric mean

Â because it's obviously more natural to talk about.

Â Okay, so just re hashing some of these points.

Â Multiplying the initial prevalence by 1.01 to the fourth power.

Â Than otherwise, multiplying it by four times is the same thing as multiplying by

Â the original four numbers in sequence. So 1.01 is the constant factor by which

Â you would need to multiply the initial prevalence each year to achieve the same

Â overall increase or decrease in prevalence over a four year period.

Â Take that in contrast to the arithmetic mean and that's the factor by which you

Â would have to add to achieve the same total increase.

Â And in this case, it's clear to me at least, that the geometric mean makes a lot

Â more sense that the arithmetic mean to talk about.

Â On the next slide, I was thinking about how to explain this.

Â I googled the geometric mean and the arithmetic mean and I found this great

Â example at the University of Toronto's website and it has a really fun geometric

Â interpretation of the arithmetic mean and the geometric mean.

Â So if you have a rectangle and A and B are the lengths of the sides of a rectangle.

Â Then the arithmetic mean A plus B over two is the length of the sides of the square

Â that has the same perimeter as the rectangle.

Â The geometric mean A times B to the one-half is the length of the side of the

Â square that has the same area. So if you're sort of interested in

Â multiplicative things like areas, you want the geometric mean of the sides.

Â If you're interested in additive things like perimeters, you want the arithmetic

Â means. So it's, I thought that was really cool

Â when I read that. So, back to statistics, the log of the

Â sample geometric mean is just an average. And so provided the expected value of log

Â x exists. Then that average has to converge just by

Â the law of the large numbers to what, I'm defining here as mu equal to the expected

Â value of log x. To remember the log of the geometric mean

Â is, is itself just an arithmetic mean. We have the law of large numbers which

Â tells us, what the arithmetic mean converges to?

Â It converges to the population mean. So therefore, the log geometric mean

Â converges to, expected value of log x, where x is a draw from the, original,

Â population, on the natural, unit scale. Not on the log scale.

Â Therefore, if you want to know what the geometric mean converges to,

Â The geometric mean is the, exponential of the log of the geometric mean.

Â Of course, because E to the log x, is x. So it would be nice if that worked out to

Â be expected value of x. But it's doesn't because E to the expected

Â value of log x, the exponent can't move inside this expected value.

Â So we get something, which I'm going to just call, E to mu, is, is exactly E to

Â the mu. And this is not expected value of x.

Â And this quantity, E to the mu which is the exponent of the expected value of log

Â x. It doesn't really have a name.

Â But I like to call it the population geometric mean cuz you know, if you have

Â that the sample arithmetic mean converges to the population mean.

Â The sample variance converges to the population variance.

Â The sample median converges to the population median.

Â Then by that logic, the sample geometric mean should converge to something that's

Â called the population geometric mean. So, I'm going to call it that.

Â I, I don't see that to often in books, but what the heck, I'm going to do it.

Â So to reiterate, the exponent of the expected value of log X is not equal to

Â the expected value of the exponent of log X, which is equal to E to the X.

Â So, what I'm referring to as the population geometric mean, is not equal to

Â the population mean that we defined earlier.

Â It is however interesting to note, that if the distribution of log X is symmetric.

Â Which, remember, that was one of the reasons at the beginning of the lecture,

Â we stated for taking logs of data is to turn skewed data to data that's more

Â symmetric. Then if the distribution of log X is

Â symmetric, then consider the median. The median is the point where point five

Â is equal to probability that log X being less than or equal to, to mu.

Â And in this case, because log X is symmetric, mu, the mean on the log scale

Â is in fact also the median. So this first statement, point five equal

Â to the probability of log X less than or equal to mu, is just reiterating the

Â statement that for this distribution that's symmetrical on the log scale, the

Â mean and the median on the log scale are equal.

Â But, now on the interior of this probability statement we can, because

Â everything's positive. And because the E function is monotonic.

Â We can take an exponent on both sides of this inequality, and get that the

Â probability of the x on the natural scale, not on the log scale.

Â But x on the natural scale. The probability that x is less than or

Â equal to E to the mu is also 50%. So the conclusion is that, for log

Â symmetric distributions, the geometric mean is estimating the median.

Â So why am I saying all this, I am making fairly simple ideas rather complicated.

Â The idea is you have data, you log it and you just do all the normal stuff you do

Â with your data, you just using it on log data.

Â And what I'm trying to say is, I'm trying to relate the quantities that you get from

Â doing that. They have interpretations back on the

Â natural scale, that's what we're trying to say.

Â You don't have to discard the natural scale units when you log data, you get a

Â lot of interesting interpretations back on the natural scale.

Â So, at any rate, if you use the central limit theorem to create a confidence

Â interval for the log measurements. Then your interval is estimating mu, the

Â expected value of the log measurements expected value of log x for log units.

Â Then if you expodentiate the interval, then you're estimating E to the MU.

Â The population geometric mean, as I'm calling it.

Â And then in the event the distribution of the log data is itself symmetric.

Â Then your exponentiated interval is also estimating the median.

Â So this is kind if a back handed way of getting the confidence interval for the

Â median. If you're willing to assume that your data

Â is symmetric the population. From which your data is drawn, it's

Â symmetric on the log-scale. Then when you take the log of the data,

Â create the confidence interval and then exponentiate the end points, then you wind

Â up with a confidence interval for the median.

Â And remember, we also talked about getting a confidence interval for the median using

Â bootstrapping, but this is a lot easier, it just uses the ordinary T confidence

Â interval. And then this is especially useful for

Â paired data when their ratio is of interest.

Â So, let's just quickly go through an example.

Â So, remember, I quoted before this book by Rozner, Fundamentals of Biostatistics,

Â which I like. It's very thorough.

Â And covers a huge chunk of biostatistical topics.

Â Any rate, on page 298 of the version that I have, which unfortunately I think was

Â the previous version than the current one. It gives a paired design where it compared

Â systolic blood pressure for people taking oral contraceptives and matched controls.

Â And so paired design is where you have a person and you have a bunch of covariates

Â that you're concerned with, when you want to compare, say, oral contraceptive use to

Â controls. You're worried that the group of people

Â that take oral contraceptives are, different than the group of people who

Â don't take oral contraceptives. So, what you might do is you might take

Â this, list of, things that you, think might, explain that, difference, and match

Â on'em so that, the person taking the oral contraceptive, they have a, twin in a

Â sense, in the, control group, that, at least insofar as the, other variables you

Â can measure, they're very close. That's this idea of matching.

Â The matching to the extreme., You know, you couldn't do it in this

Â experiment, but imagine if you were investigating aspirin.

Â You would, say, give a person an aspirin and then after a suitable wash out period,

Â give them a placebo. And then that person would be perfectly

Â matched to themselves as their own control.

Â So that's the extreme version of this case, but let's suppose you're in a

Â circumstance like this where you can't really, randomize people to contraceptive

Â use. You couldn't do crossover experiment like

Â that. So you could match people as closely as

Â you could on all these other things that you think might differentiate

Â contraceptive users from controls. And match them as closely as possible.

Â Anyways, that's a match design. But the point for our discussion, is that

Â person one. Who is in the oral contraceptive group,

Â and person one, who is in the control group.

Â They are tied together. And so we want to utilize that information

Â that they're similar. So what we might do is take the blood

Â pressure for person one. From the oral contraceptive use.

Â And the systolic blood pressure from person one in the control group.

Â And analyze their ratios, right? And so we might be interested in ratios,

Â because we just might be interested in the interpretation of, well, what percent

Â increase and decrease does a person in the contraceptive group have over their

Â associated controls. So imagine if we took ratios, and then

Â logged the ratio. Well, that would just be the log

Â difference of the two measurements. Then we could just do an ordinary one

Â sample t conference interval for the log of the ratios done matched pair by matched

Â pair. And so in this case, the geometric mean of

Â the ratios works out to be 1.04. Which in this case the order in which I

Â was dividing implied to four% increase in systolic blood pressure for the oral

Â contraceptive users. And t interval on the log scale.

Â So when I took each measurement, an oral contraceptive user, log the control user,

Â took the difference, pair by pair. I wound up with then, n measurements where

Â I started with 2N total measurements, each in pairs.

Â I had my n measurements on the log scale, I had an ordinary t interval and I

Â calculated it and I got 0.010 and 0.067. In this case, the units would be in log,

Â millimeters, or mercury. What we're interested in on a log scale is

Â whether zero is in this interval or not, right?

Â Zero is the important thing on the log scale.

Â If we exponentiate the interval, we get 1.01 to 1.069.

Â So an estimated via 95% confidence interval, one% to seven% increase in

Â systolic blood pressure for the oral contraceptive users relative to the

Â controls. And so on the exponentiated scale we're

Â interested in whether one is in the interval.

Â On the log scale, we're interested in whether zero is in the interval.

Â By the way if your numbers are kind of small, like in this case 0.01 and 0.067.

Â Exponentiating is about like one plus and if you are math person you take the Taylor

Â expansion of e to the x and go out one term and you see that it is pretty close

Â to one plus. You can actually exponentiate things very

Â quickly but just by taking one plus and then obviously if you, number that you're

Â looking at is pretty close to one and you want to log it, you can do one minus and

Â you same thing take the Taylor expansion for log and go out one term.

Â You can see, if the number is close to zero and you want to exponentiate it one

Â plus works pretty well in approximation if the number is pretty close to one.

Â And you want to log it. One minus does pretty well as well.

Â That's a, trick that's very useful, like when you do, logistic regression and

Â things like this where you need to, take exponents, quickly.

Â So let me just talk about this example Just a little bit more.

Â This estimate, 1.01 to 1.07. This one% to seven% estimated increase

Â between the two groups. That is a conference interval for this

Â sort of paired ratio of geometric means. And that's why it's useful in that we're

Â estimating a ratio here. So now let's just go through the same

Â exact exercise but instead of when we have parrot observation, we have two

Â independent groups. If you log the data from group one, log

Â the data from group two, create a confidence interval for the.

Â Difference in the group means on the log scale, and then exponentiate it, then what

Â your estimating, that confidence interval is an estimate of e to the mu one divided

Â by e to the mu two, that confidence interval is exactly an estimate of the

Â ratio of the population geometric means. Of course it's an estimate on the log

Â scale of the difference in the expected values on the, the mean on the log scale.

Â But if you exponentiate it, then you get exactly an interval for the ratio of the

Â population geometric means. And if you're willing to assume that the

Â data is symmetric in the log scale, then this is also equal to a ratio in the

Â population medians. There's one distribution where take logs

Â of things and they wind up as Gaussian is so important that we give it a name, we

Â call it the log normal distribution. And a random variable's log normally

Â distributed if it's log is a normally distributed random variable.

Â Note, it's not the log of a normal random variable as it's name kind of implies.

Â You can't take the log of a normal random variable because those can be negative and

Â you can't take the log of them. So if you want to remember what's a log

Â normal random variable, remember this phrase.

Â I am log normal means think logs of me then I'll be normal and then you'll

Â remember the correct order. But then also think when you are assuming

Â what the log normal is, if you are taking the log of something that's possibly

Â negative then you're doing it wrong. Okay so again log normal random variables

Â are not logs of normal random variables. As I stay here you can't even take the log

Â of normal random variable because it can be negative.

Â So formally, X is lognormal and it depends on two parameters, mui and sigma squared.

Â If log of X is normal mui comma sigma squared, and again that mirrors kinda what

Â we're often doing with logs. We're trying to take logs of things.

Â So that, on the log scale, the data is symmetric.

Â And then hopefully, the population distribution is also symmetric.

Â And if log of X is normal, for X being log normal.

Â Then if Y is normal, mue, comma, sigma squared.

Â Then E to the Y is log normal. So you can generate a log normal by

Â generating a normal random variable and exponentiating it.

Â I give you the log normal density here. If you want to, it depends on the mu and

Â the sigma squared. Its mean is E to the mu.

Â Plus sigma squared over two where mu and sigma squared are these mean and variance

Â on the log scale. And the variance is two mu plus, sigma

Â squared times e to the sigma squared minus one.

Â And its median, is e to the mu. And of course, it's geometric, what I'm

Â calling its population geometric mean is E to the Mu as well.

Â So you can see here, this gives you an exact example, where expected value of X.

Â And E to the expected value of log X are two different things.

Â Expected value of X in this case, when X is log normal, is E to the Mu plus sigma

Â squared over two. E to the expected value of log X is E to

Â the Mu. Okay.

Â So if X1XN are log normal MU sigma squared.

Â Then log X1 to log XN, where I'm calling this YN, Y1 up to YN, are normally

Â distributed with mean mu and variance sigma squared.

Â So they satisfy the conditions to create a T confidence interval.

Â And then mu is the log of the median of the XI.

Â E to the mu then gives the median on the original scale.

Â It also gives you the population geometric mean.

Â And then, again, assuming log normality in exponentiating T confidence intervals for

Â the difference in two logs. Two log, again, implies that your

Â confidence interval is estimating ratios of geometric means.

Â So, let's just go through a quick example of doing this.

Â Now I'm assuming you can do the arithmetic of this because you already know how to

Â create two group T confidence intervals. So all that we're doing is logging the

Â data and doing something you already know how to do.

Â So I just want to go through the interpretation real quick.

Â So imagine if you took gray matter volumes.

Â I actually did this for some data that I have.

Â I have brain gray matter volumes for a young and an old group defined as younger

Â than 65 and older than 65. But of course this doesn't account for

Â being young at heart or whatever. Young and old, as per my definition, but

Â if you're 65, rest assured, I don't think you're old.

Â It's just the definition I'm doing here. So we did two separate group intervals.

Â And for the old group got 13.24 to 13.27. And for the younger group got 13.29 to

Â 13.31. Both of them are in the units of log cubic

Â centimeters. If you exponentiate those intervals you

Â get, 564 for one group and 578 about for the other one.

Â And 592. To, 606 cubic centimeters for both.

Â For old and young, respectively. So both of these.

Â Intervals estimate the population geometric mean, gray matter volume among

Â the older and younger groups respectively. If we're willing to assume that the

Â population of brain volumes on the log scale are symmetric then both of these

Â intervals estimate the population median gray matter volume for old and young

Â respectively. Then if we were to take the two groups and

Â do a two group T-interval on the log measurements yields 0.032 to 0.066, log

Â cubic centimeters, expedentiate this, you get an interval of 1.032 to 1.068, you

Â know, again, remember the trick, you add one when you expedentiate a, close to

Â zero. You wind with about a three% to seven%

Â higher. Geometric mean brain volume among the

Â younger group than the older group or if we're talking about medians, if we're

Â willing to assume that individual populations are symmetrically distributed,

Â then that would be estimated between three and seven percent.

Â Increase in grey matter volume for the younger group.

Â This, of course, being the case because as we age, we start to lose a little bit of

Â grey matter volume over time. Of course, you develop more neuronal

Â connections, so you get wiser. So you have, maybe, more neuronal

Â connections, but decrease in volume. So, anyway, what I hope you learned from

Â this was when you take logs of measurements and do what we talked about

Â in terms of creating confidence intervals, and exponentiate the intervals.

Â I hope you know what the estimates are then referring to.

Â And it's a common problem, people do this all tie time.

Â But I"m not sure if people always understand exactly what they're doing.

Â And that's why I devote an entire lecture to the subject of logging which is, in

Â practice, is a trivial extension of what we've already done.

Â Take logs of your data, do what we already do, and then exponentiate the intervals.

Â So no change in what we're doing. But I wanted everyone to understand

Â exactly what the implications of those things were.

Â And why log is sort of special in the sense it yields uniquely interpretable.

Â Results as opposed to doing other functions.

Â You could say, take cube root of the data, create the confidence interval on the cube

Â root scale and then. Raise the interval to the third power.

Â And you wouldn't get the same nice interpretations like you do with log.

Â Log is special that way. Alright, well thanks troops.

Â This was our last lecture. I hope you enjoyed the class.

Â And hope you survived the intense biostatistical training.

Â And I hope you go on to do great things with this knowledge.

Â And all the other courses you take from Corsara.

Â