0:00

Next we discuss model selection, which is the science

and art of picking variables for a multiple regression model.

We're going to talk about stepwise model selection methods,

based on criteria of p-values, or adjusted R squared.

And we're also going to mention briefly at the end,

that sometimes we might just pick variables based on expert opinion.

0:22

One stepwise model selection method is backwards elimination.

Here, we start with a full model that is a model with all possible co-variants or

predictors included, and then we drop variables one

at a time until a parsimonious model is reached.

The, on the other hand, we could also

do forward selection, where we start basically with an

empty model, and then we add variables one

at a time until a parsimonious model is reached.

There are many criteria for model selection.

We're going to be focusing on p values and adjusted R squareds.

However, other model selection criteria that you might

hear of are AIC, that's the Aikake Information Criterion.

BIC, Bayesian Information Criterion.

DIC, Deviance Information Criterion.

Bayes factor or Mallow's Cp.

There are many others that you can stumble upon as

well, but these tend to be the most commonly used ones.

The latter ones that we listed are beyond the scope of this course, though.

1:20

Let's start with backwards elimination using the adjusted R squared method.

Here, we start with the full model, the model with all possible predictors.

We drop one variable at a time, and record adjusted R squared of each smaller model.

Then, we pick the model with the highest increase in adjusted R squared.

We repeat until none of the models yield an increase in adjusted R squared.

1:44

Let's give an example for how to do that using the dataset that, from earlier.

For kids cognitive scores and predicting that value

from mom's high school status, mom's IQ score, whether

or not the mom worked during the first

three years of the kid's life, and mom's age.

The adjusted R squared for the full model is 20.98%.

And the second step, what we do, is we try

removing each one of the variables one at a time.

So for example, here we've removed the high

school status, and the adjusted R squared is 20.27%.

This is not an increase over what we had started with.

So, we know that it's not a good idea to move to this model.

We can also try removing mom's iq and we get

a really low adjusted R squared if we do that.

It must be that the iq variable is very

important for the prediction of the kid's cognitive score.

2:38

We can also try removing mom's work status and

that gives us an adjusted R squared of 20.95%.

Still not an increase from the original full model.

And lastly, let's try removing mom's age at birth of the child,

and we can see that the adjusted R squared has actually increased.

We had started with 20.98% and now we are at 21.09%.

A tiny increase, but still an increase, so we know, that in the first step, we need

to pick the model where we're predicting kid's score from high school status, IQ.

And work status of the mother.

Next, we move on to the second step, where

we once again try removing each one of the variables

one at a time, and we can see that none

of these options actually yield an increased adjusted R squared.

Therefore, our final result is going to be the model that predicts kids.

Cognitive test score from Mom's high school status,

Mom's IQ and the work status of the mother.

3:41

In backwards elimination using the p-value method, we once again start with the full

model, then we drop the variable with the highest p-value and refit a smaller model.

We repeat this until all variables left in the model are significant.

To give an example, here is our full model, where we were predicting the kids'

cognitives course from the four predictors and the

variable with the highest p-value is mom's age.

So in the first step, we remove mom age from the model and refit the model again.

Using only high school status, IQ and work.

And once again we can see that mom's work status has a non-significant p-value.

And therefore, we would remove that from the model as well.

And refit the model one more time with simply high school status.

And the IQ score and we can see that now both

of these predictors have significant p-values so we would stop here.

As you can see we resulted in a slightly different

model using the p-value approach verses the adjust R squared approach.

And this is not unexpected.

We would expect to get very similar models but not

necessarily exactly the same model because our decision criteria is different.

5:00

Let's take a look at another example for practice.

The following model uses data from the American Community

Survey to predict income from hours worked per week.

Race, and gender.

Which variable, if any, should be dropped from the

model first when doing backwards elimination using the p-value approach?

5:21

Hours worked has a tiny p-value so we would

certainly not drop that from our model, and similarly gender.

And the other variable that's in the model is our race variable.

And we need to consider this variable all at once

because we can't simply drop one level of an existing variable.

And because at least one of the levels of this variable has a

significant p-value, you can see that for Asian we're seeing a tiny p-value.

We would actually keep this variable in the model as well.

Therefore, we don't drop any variables here.

5:55

This is an important point, so let's repeat that.

If you have a categorical variable with multiple levels, you cannot

drop part, some of the levels of that variable and keep others.

You either need to decide that to keep the entire variable

as a whole or drop it as a whole and in

this case, because there is at least one level that has

a small p value, meaning that there is some significance there.

We would actually keep the entire variable.

If all of the levels of the variable

had high p-values such that they wouldn't be, there

wouldn't be any levels that are significant, then

we would drop the entire variable as a whole.

So

6:39

we talked about two approaches, adjusted r squared verses p-value.

We even mentioned that sometimes they yield slightly different results.

Then how do we know which one to use.

We use p-value approach if what we're interested

in is finding out which predictors are statistically significant.

On the other hand, if we're interested in more reliable predictions

from R model we want to use the adjusted R squared method.

7:06

The p-value method depends on the somewhat arbitrary 5%, or

whatever other percent you use for your significance level cutoff.

And if you use a different significance level,

you're going to end up with a different model.

It's used more commonly though, since it requires fitting fewer models.

Remember at each stage of the adjusted R squared

method, we had dropped one variable at a time.

And refit a bunch of models to determine which one to go with, versus

in the p-value approach, you simply drop

the variable with the highest p-value and proceed.

And it's the more commonly used approach because it's easier to implement.

However, because it relies on this

arbitrary significance level cut-off, it might

be more favorable to use the adjusted R squared method for model selection.

7:56

Let's now talk about forward selection.

We start with single predictor regressions

of response versus each explanatory variable.

We then pick the model with the highest

adjusted R squared, add the remaining variables one

at a time to the existing model and

pick the model with the highest adjusted R squared.

We repeat until the addition of any of the other

remaining variables does not result in a higher adjusted R squared.

Let's illustrate this with an example using the cognitive test scores data.

We start with four simple linear regressions,

one for each of the candidate predictors in

our data set, and then we pick the model with the highest adjusted R squared.

So the first variable that we're going to be

adding to our model is going to be mom's IQ.

In the next step we then try the remaining

3 variables and once again pick the model with the

highest adjusted R squared and that's going to be mom's

IQ with the addition now of mom's high school status.

And net in the next step we once again try the 2 remaining

variables and if there's an increase in the adjusted R squared which there is.

Then we move on to the one more complicated model.

Lastly, we try the full model, but the adjusted R squared does not

go up, therefore we're going to stick with the model in step three.

And note that we arrived at the same model, whether

we went backwards or forwards, using the adjusted R squared criteria.

9:28

To do forward selection using p-values, we start with

single predictor regressions of response versus each explanatory variable.

We then pick the variable with the lowest significant p-value.

And we add the remaining variables one at a time to the

existing model, and pick the variable with the lowest significant p-value again.

We repeat until any of the remaining

variables do not have a significant p-value.

9:54

We talked about algorithmic ways of

doing model selection, however, sometimes variables can

be included in, or eliminated from the model based on expert opinion as well.

For example, if you're setting a certain variable

you might choose to leave that variable in the

model regardless of whether its significant or whether it

would yield a higher adjusted R squared or not.

So to wrap things up, let's finally fit our final model.

Remember we had selected the variables mom's high school status.

Mom's IQ, and mom's work, and if we take a look at the summary output, we can

see that the variables mom's high school status, and

mom's IQ are statistically significant at the 5% level.

And mom's work status is not, but remember, we selected

this model using the adjusted R squared method, which tells

us that including that variable actually gives the model higher

predictive power even though the variable may not be statistically significant.