[MUSIC] Hi, in this module, I'm going to introduce you to correlation and regression. These are basic statistical techniques. You can actually take semester long or even year long courses introducing you to the basics as well as various extensions. All I can do here is give you a taste of what they are and some sense of what the interpretation of the results is. So correlation and regression are typically used to study relationships along continuous variables. When we just looked at tabulation we use that to look at relationships among what we referred to as categorical variables. For example, we looked at different categories of year and different categories of educational attainment. And then looked at how the population was sorted into the categories of education according to where they were by year. So tabulation works well for a wide range of variables that are inherently categorical. For example, race, ethnicity. These are things where there is no quantitative measure, you have to divide people into groups and then tabulate. There are advanced methods for regression related techniques but we don't have time to get into them now. So continuous variables refer to variables that we can measure as a quantity. Things like income, height, weight, things that we think of as continuous. So correlation and regression are used to study relationships among these sorts of variables. Things that we want to preserve in their original form as continuous variables. Even though, as we saw in the previous example, sometimes we can actually categorize them, or turn them into categories. Basically, correlation and regression measure the strength of the linear relationship between two variables. Linear refers to the idea, is that single unit changes in one variable are associated with fixed changes in the other variable. So for example, if we're thinking about regressing weight on height, then a linear regression will tell us, for example, a one inch increase in height, what's the average increase in weight in a population? So by themselves, and this is extremely important, correlation and regression have nothing to say about cause and effect. Saying anything about cause and effect, as we learned in previous lectures, really requires a proper experimental design. That includes exogenous variation in the x variable that we are sure is not related to other variables that are also influencing our outcome or y variable. So what does regression do? We often want to assess whether a unit change in one variable, it could be height, it could be education measured in years. Five years, six years, seven years, eight years of education. Whether that's systematically associated with changes in some outcome variable or y variable. So here we have a hypothetical pair of variables, a y variable that we're trying to explain. An x variable that we think might be driving or influencing y. So what a regression does is it measures, on average, if you increase X by one, as shown on the blue figure here, what's the average change in Y? That's all that regression is, it's really an attempt to come up with a estimation of the linear association that is associated with changes in x, leading to changes in y. So this is the same figure, but blown up to make it a little clearer so you can see that again we're trying to estimate how a one unit change in x, what sort of impact or association does it have with changes in y? That's what we're trying to measure. So as I just said regression measures the average change in y associated with a one unit change in x. The average change, in this case 1.3, every unit increase in x leads to a 1.3 unit in y. That average change is what we refer to as the regression coefficient. That comes out of our estimation of a regression model. I want to talk about correlation, which is actually just a special case of linear regression. Correlation measures the number of standard deviations of change in some outcome variable y associated with a one standard deviation change in x. Actually, in the case of correlation, what we choose as y or what we choose as x is actually arbitrary. We can reverse them, and we still get the same correlations. So, let's think about two variables, we'll just call them x and y, but we could flip them and we'd end up with the same answers. But we have these x variable, which has a mean of 4.4, and a y with a mean of 10.77. The standard deviation of x is 3.16 and the standard deviation of y is 4.75. As you may remember, standard deviation is a measure of the amount of spread or variation within a distribution. I can't get into too much detail right now, it's basic statistics, you may want to go back and check your textbooks if you need a refresher. Now, if we think about the change in y associated with a change in x, correlation is the change in y that occurs when we move x by an entire standard deviation. So here, if we move x by 3.16, you see the bar that just turned solid, that goes between the horizontal solid blue bar and the red regression line, that's the amount of change in y associated with a one standard deviation change in x. So what we really want for the correlation coefficient, it's that change in y that occurs as the result of a one standard deviation change in x, as a fraction of the standard deviation of y. So whatever that number is, it's about four. We move from about 10 to about 14, when we moved x up by 3.16. The correlation coefficient is actually just that shift, that four, over 4.75, the standard deviation of y. So, we can think about the correlation of x and y as the height of that solid bar that we've circled here, divided by the height of that dash bar, again, that we've circled. So the change in y associated with a one standard deviation change in x, divided by the standard deviation of y. So the change in y as a fraction of the standard deviation in y. Now another way of thinking about the correlation coefficient is to think of it in terms of standardized scores. When we talk about standardization that refers to taking a set of numbers perhaps our x variable. And for every number in x, every observation in x, we subtract the mean of x and then divide by the standard deviation of x to produce a standardized score that reflects the number of standard deviations that a particular observation is away from the mean of that variable. So if we have a standardized score of two, that would mean that the standardized score was, well, two. And that the value of x was two standard deviations above its mean. An easy example, in terms of thinking about it, is intelligence tests. So intelligence tests are actually constructed by design to have a mean of 100 and a standard deviation of ten. So somebody with an IQ of 130, if we subtract 100 from 130 that gives us 30. Then we divide by ten, the standard deviation. We get three so a z score of three implies being three standard deviations above the mean, or in this case say, IQ of 130. Now standardized values are also called z scores. You may have heard of them in that context in some other place. So the correlation coefficient, usually denoted with a small r in italics, is the average change in the z score of y, the standardized score of y, associated with a one unit change in the standardized score of x. That is a one standard deviation change in x. Now, it's important to keep in mind that one of the special properties of correlation coefficients is that because they reflect associations among standardized scores, they are unit-less. So the standardized scores actually divide out the original units in which we measured, whatever it is we're looking at. So if we're looking at the correlation between, perhaps, weight and height. Now, if we ran a regression, we might get a coefficient that would reflect the average number of pounds in increase associated with a one inch change in height. And if we recalculated the data to use centimeters and kilograms, and the, looked at the number of kilograms of change associated with one centimeter change in height, we'd get different regression coefficients. A correlation coefficient for both sets of data would be actually the same. Because in both cases if we take our inches and our pounds data, convert to standardized scores, subtracting the means, dividing by the standard deviations. We sweep away the original units and we're just left with unit-less standardized scores. So, correlation coefficients are irrelevant or unrelated to the underlying units in which values are measured. Correlation coefficients also are always between minus one and one. A one standard deviation change in one variable can never produce more than a one standard deviation change in the other variable. That's a mathematical proof that we don't have time for here.