0:07

Imagine that we have two variables, X and Y.

Â Perhaps it's education and health and

Â we're interested in whether or not the relationship between them is causal.

Â The traditional approach to dealing with the omitted variables that might create

Â a spruce relationship between them is to measure them and then control for

Â them in the analysis.

Â 0:44

Now, it's important to note that if a omitted variable only affects

Â the outcome or it only affects the right hand side variables,

Â then it wont lead to a spurious relationship if we don't control for them.

Â Give me the other reasons to control for such omitted variables, but

Â eliminating the possibility of a spurious relationship is not one of them.

Â 1:11

Imagine that we have some variable x education, and some outcome variable,

Â health y, and we're interested in the relationship between them, but

Â we're worried that there's some w out there, perhaps it's parental

Â characteristics that affect both of them, in might be to a spurious relationship.

Â To control for W, we compare values of the outcome Y among subjects with different

Â values of X to see if there's a systematic relationship between X and Y and Y and X.

Â But we do so for subgroups that have identical values of W.

Â So basically we can imagine dividing our sample into

Â subgroups in within which the values of W are identical.

Â The parental characteristics are identical and then within each of these we think

Â about or look at the relationship between X and Y.

Â 2:08

So we call this holding a variable, in this case W constant.

Â Because we're looking at the relationship between X and

Â Y in which the value of W is not changing.

Â And so we examine whether the relationship between X and

Â Y that holds across different values of W.

Â 2:40

If we only have a limited number of right hand side variables,

Â that is we have a X variable, and then perhaps at most one or

Â two omitted variables W that we've measured and

Â want to control for, then we can do our control by tabulation.

Â Let's look at an example, a simple one, where our Y variable is the crude death

Â rate and our x variable is race, only taking on two variables in this example,

Â black and white, and we're looking at the United States.

Â Now if we look at the United States, this was 2014.

Â 3:19

The overall crude death rates of black was actually lower than that of whites by

Â potential margin.

Â This might give us the increte misleading impression that somehow with respect

Â to mortality, blacks are better off than whites in United States.

Â This is contrary to common sense given what we know about

Â the socioeconomic differences between blacks and whites in the United states.

Â And it turns out that what we really need to do is control for age.

Â There are big differences between the average ages of blacks and whites.

Â Whites in the United States tend to be older than blacks.

Â 3:59

And so, when we control for

Â age and we make comparisons between the death rates of blacks and

Â whites within age groups, that is, among people of roughly similar age and

Â every age up to age 85, blacks actually have higher death rates.

Â The only exception is age 85 and above.

Â But for reasons we can't get in here that actually reflects

Â problems with the recording of death rates above age 85.

Â So if we average the difference in the death rates between blacks and

Â whites across the age groups, it turns out that there's a weight overall advantage.

Â And that again controlling forage it seems that white death rates lower

Â than black death rates.Now this is a straightforward example because again.

Â As we said, we had a very limited number of right hand side variables.

Â Just one x variables that we were especially interested in

Â race which we only took on two values.

Â And age which was easy to subdivide meaningfully into a small

Â number of category.

Â 5:08

Now let's just recap what we just did using a finer gradation for ages.

Â So in the United States in 2014, there were 893 deaths per 100,000.

Â Black death rates were 697 per 100,000.

Â So controlling for age, however, black death rates were higher.

Â Now, we can visualize this, actually,

Â if we look at much more narrowly defined age groups.

Â And we compare black and white death rates within each age group.

Â So the red bars are persistently higher than the blue bars within each age groups.

Â So blacks had higher death rates than whites once we control for age.

Â 5:53

So we just talked about tabulation as an approach where there are only

Â a few variables, and the right hand side variables the x and the w variables

Â that we want to control for, take on only a limited number of values.

Â We have a more complex situation where there are more values,

Â variables and they tend to take on a wider range of values,

Â then we normally will have to engage in some form of regression analysis.

Â So regression analysis essentially looks at the average change in some

Â outcome variable, as a function of changes in a right-hand side variable.

Â So here's a simple example,

Â where we just have life expectancy as a function of per capita GDP.

Â 6:42

In different countries and we can fit a curve which

Â indicates that there's a systematic relationship between

Â per capita GDP where every $10,000 increase in per capita GDP

Â increases life expectancy by about three years.

Â Now here we haven't controlled for anything other than the per capita GDP.

Â So regression analysis essentially controls for

Â these additional variables that we might be worried about

Â by adding them to the right hand side of the analysis.

Â So let's think about some examples

Â of control variables that might appear in a regression analysis.

Â So if we have some outcome say income, and

Â we're really interested in the effects of education, perhaps it's measured in years,

Â and income is measured in dollars, we might want to introduce in a regression,

Â control variables like age and work experience.

Â This both are those will effect education, and

Â they will effect income, the effect of age and

Â work experience on incomes, probably fairly straightforward to understand.

Â We have to think about the effect of age on education in societies where

Â education has been changing rapidly in recent decades.

Â So you can have examples of societies where because education has been expanding

Â actually younger people are actually better educated and older people.

Â In most situations feeling the control for age may lead to unusual results for

Â example education being inversely associated with income.

Â This is the best educated people are actually younger people who because they

Â are relatively young are not earning as much as older people have more seniority.

Â So when we introduce control for age those effects go away.

Â Now our work normally something we want a control for.

Â There is a straightforward effect of work experience on income.

Â And it may be related to education.

Â Where people stayed in school longer, then they may have less work

Â experience than other people who left school earlier who are the same age.

Â For looking at a international comparison like the one that we just showed in

Â the previous slide we could be thinking about life expectancies of function

Â per capita GDP.

Â And then we might want to control for things like health expenditures and

Â education to assess whether the effective per capita GDP on life expectancy

Â is direct, or is actually working through health expenditures and education.

Â 9:56

Finally if we're looking at say marriage, whether or not people get married,

Â there's a function of whether or not their parents were ever divorced.

Â Then we'd also want to control for eduction and income.

Â Because education and income one hand then we find maybe associated with whether or

Â not parents divorced.

Â On the other hand,

Â they may influence marriage chances whether like over marriage.

Â Now, some limitations that we have to think about

Â if we're conducting a regression analysis.

Â Or introducing control variables.

Â So basically, if we want to introduce control variables, we do have to be able

Â to measure the trait or the characteristic that we're concerned about.

Â 10:35

If we're using secondary data from other sources,

Â that dataset that we 've downloaded from the Internet, then

Â the relevant variables once that we're worried about may not be available to us.

Â So for example, if we've downloaded information on say, health and education.

Â But, we are worried that parental characteristics may effect both of these,

Â it may be that that's just not in the data set that we downloaded and

Â we can't go back and measure it.

Â Now there are other situations where if we're collecting data ourselves,

Â then there may be some important traits or characteristics that may be simply

Â difficult or downright impossible to measure.

Â So there may be intangible characteristics of an individual,

Â a neighborhood where there's sorroundings that are extremely difficult to measure in

Â a meaningful way in an analysis.

Â For example, you may be worried that somehow the neighborhood that somebody

Â grows up in may influence both their education and their health.

Â And yet we're not really specifically sure about what are the features of

Â the neighborhood that really matter in which case we can't go out and

Â measure them.

Â We just think that neighborhood in general, matters.

Â So we run into those sorts of problems where we have intangible or

Â difficult to measure characteristics.

Â So overall, because of the possibility that there maybe variables out there

Â that we can even measure even if we imagine them.

Â Then it's through a straight forward regression analysis just by adding control

Â variables.

Â It's still difficult to completely rule out the possibility of a role for

Â an omitted variable.

Â