0:07

Imagine that we have two variables, X and Y.

Perhaps it's education and health and

we're interested in whether or not the relationship between them is causal.

The traditional approach to dealing with the omitted variables that might create

a spruce relationship between them is to measure them and then control for

them in the analysis.

0:44

Now, it's important to note that if a omitted variable only affects

the outcome or it only affects the right hand side variables,

then it wont lead to a spurious relationship if we don't control for them.

Give me the other reasons to control for such omitted variables, but

eliminating the possibility of a spurious relationship is not one of them.

1:11

Imagine that we have some variable x education, and some outcome variable,

health y, and we're interested in the relationship between them, but

we're worried that there's some w out there, perhaps it's parental

characteristics that affect both of them, in might be to a spurious relationship.

To control for W, we compare values of the outcome Y among subjects with different

values of X to see if there's a systematic relationship between X and Y and Y and X.

But we do so for subgroups that have identical values of W.

So basically we can imagine dividing our sample into

subgroups in within which the values of W are identical.

The parental characteristics are identical and then within each of these we think

about or look at the relationship between X and Y.

2:08

So we call this holding a variable, in this case W constant.

Because we're looking at the relationship between X and

Y in which the value of W is not changing.

And so we examine whether the relationship between X and

Y that holds across different values of W.

2:40

If we only have a limited number of right hand side variables,

that is we have a X variable, and then perhaps at most one or

two omitted variables W that we've measured and

want to control for, then we can do our control by tabulation.

Let's look at an example, a simple one, where our Y variable is the crude death

rate and our x variable is race, only taking on two variables in this example,

black and white, and we're looking at the United States.

Now if we look at the United States, this was 2014.

3:19

The overall crude death rates of black was actually lower than that of whites by

potential margin.

This might give us the increte misleading impression that somehow with respect

to mortality, blacks are better off than whites in United States.

This is contrary to common sense given what we know about

the socioeconomic differences between blacks and whites in the United states.

And it turns out that what we really need to do is control for age.

There are big differences between the average ages of blacks and whites.

Whites in the United States tend to be older than blacks.

3:59

And so, when we control for

age and we make comparisons between the death rates of blacks and

whites within age groups, that is, among people of roughly similar age and

every age up to age 85, blacks actually have higher death rates.

The only exception is age 85 and above.

But for reasons we can't get in here that actually reflects

problems with the recording of death rates above age 85.

So if we average the difference in the death rates between blacks and

whites across the age groups, it turns out that there's a weight overall advantage.

And that again controlling forage it seems that white death rates lower

than black death rates.Now this is a straightforward example because again.

As we said, we had a very limited number of right hand side variables.

Just one x variables that we were especially interested in

race which we only took on two values.

And age which was easy to subdivide meaningfully into a small

number of category.

5:08

Now let's just recap what we just did using a finer gradation for ages.

So in the United States in 2014, there were 893 deaths per 100,000.

Black death rates were 697 per 100,000.

So controlling for age, however, black death rates were higher.

Now, we can visualize this, actually,

if we look at much more narrowly defined age groups.

And we compare black and white death rates within each age group.

So the red bars are persistently higher than the blue bars within each age groups.

So blacks had higher death rates than whites once we control for age.

5:53

So we just talked about tabulation as an approach where there are only

a few variables, and the right hand side variables the x and the w variables

that we want to control for, take on only a limited number of values.

We have a more complex situation where there are more values,

variables and they tend to take on a wider range of values,

then we normally will have to engage in some form of regression analysis.

So regression analysis essentially looks at the average change in some

outcome variable, as a function of changes in a right-hand side variable.

So here's a simple example,

where we just have life expectancy as a function of per capita GDP.

6:42

In different countries and we can fit a curve which

indicates that there's a systematic relationship between

per capita GDP where every $10,000 increase in per capita GDP

increases life expectancy by about three years.

Now here we haven't controlled for anything other than the per capita GDP.

So regression analysis essentially controls for

these additional variables that we might be worried about

by adding them to the right hand side of the analysis.

So let's think about some examples

of control variables that might appear in a regression analysis.

So if we have some outcome say income, and

we're really interested in the effects of education, perhaps it's measured in years,

and income is measured in dollars, we might want to introduce in a regression,

control variables like age and work experience.

This both are those will effect education, and

they will effect income, the effect of age and

work experience on incomes, probably fairly straightforward to understand.

We have to think about the effect of age on education in societies where

education has been changing rapidly in recent decades.

So you can have examples of societies where because education has been expanding

actually younger people are actually better educated and older people.

In most situations feeling the control for age may lead to unusual results for

example education being inversely associated with income.

This is the best educated people are actually younger people who because they

are relatively young are not earning as much as older people have more seniority.

So when we introduce control for age those effects go away.

Now our work normally something we want a control for.

There is a straightforward effect of work experience on income.

And it may be related to education.

Where people stayed in school longer, then they may have less work

experience than other people who left school earlier who are the same age.

For looking at a international comparison like the one that we just showed in

the previous slide we could be thinking about life expectancies of function

per capita GDP.

And then we might want to control for things like health expenditures and

education to assess whether the effective per capita GDP on life expectancy

is direct, or is actually working through health expenditures and education.

9:56

Finally if we're looking at say marriage, whether or not people get married,

there's a function of whether or not their parents were ever divorced.

Then we'd also want to control for eduction and income.

Because education and income one hand then we find maybe associated with whether or

not parents divorced.

On the other hand,

they may influence marriage chances whether like over marriage.

Now, some limitations that we have to think about

if we're conducting a regression analysis.

Or introducing control variables.

So basically, if we want to introduce control variables, we do have to be able

to measure the trait or the characteristic that we're concerned about.

10:35

If we're using secondary data from other sources,

that dataset that we 've downloaded from the Internet, then

the relevant variables once that we're worried about may not be available to us.

So for example, if we've downloaded information on say, health and education.

But, we are worried that parental characteristics may effect both of these,

it may be that that's just not in the data set that we downloaded and

we can't go back and measure it.

Now there are other situations where if we're collecting data ourselves,

then there may be some important traits or characteristics that may be simply

difficult or downright impossible to measure.

So there may be intangible characteristics of an individual,

a neighborhood where there's sorroundings that are extremely difficult to measure in

a meaningful way in an analysis.

For example, you may be worried that somehow the neighborhood that somebody

grows up in may influence both their education and their health.

And yet we're not really specifically sure about what are the features of

the neighborhood that really matter in which case we can't go out and

measure them.

We just think that neighborhood in general, matters.

So we run into those sorts of problems where we have intangible or

difficult to measure characteristics.

So overall, because of the possibility that there maybe variables out there

that we can even measure even if we imagine them.

Then it's through a straight forward regression analysis just by adding control

variables.

It's still difficult to completely rule out the possibility of a role for

an omitted variable.