On the other hand, if you can give a person one kind of aspirin and
then later on give them a different kind of aspirin when they have another
headache, that would compare each person to themselves, right?
Control block on the person so to speak.
So that's a design strategy.
Now there's some nuance with this design strategy as well.
What happens if there's some residual effect of the first aspirin when you
give the second one?
So maybe you could handle that with some sort of wash out period,
long wash out period or something like that.
But anyway, the point of that design is to make it, so that you're comparing people
with themselves to control and everything that's intrinsic to the person.
These across time periods control for that by giving both aspects to each person.
Maybe you would randomize the order in which they received them that's
called a crossword design.
At any rate, the broader point that I'm trying to make is, it's often the case
that good thoughtful experimental design can really eliminate the need for
some of the main considerations that you would have to go through in model
building if you were to just collect data in an observational fashion.
[SOUND] The last thing I would say is there's one automated search model
technique that I like quite a bit and I find it very useful and
it's the idea of looking at nested models.
So, I'm often interested in a particular variable and I'm very
interested in how the other variables that I've collected will impact it.
So, I'm interested in a treatment or something like that.
Some important variable, but I'm worried that my treatment groups and
imbalanced with respect to potentially some of these other variables.
So what I'd like to look at is the model that just includes the treatment by itself
and the model that includes the treatment and let's say, age.
If the ages weren't really balanced between the two treatment groups and
then one that looks at age and gender, if maybe the genders between the two
groups weren't really balanced and then so on.
And this idea of creating models that are nested,
every successive model contains all the terms of the previous model
leads to a very easy way of testing each successive model.
And these nested model examples are very easy to do, so I'm just
going to show you some code right here on how you do nest and model testing in R.
So I fit three linear models to our SWF dataset,
the first one just includes agriculture.
Let's pretend that, that's the variable that we're interested in and
then the next one includes agriculture and examination in education.
I put both of those in,
because I'm thinking they're kind of measuring the same thing.
But now after this lecture, I'm concerned over the possibility that they're too
much of measuring the same thing, but let's put that aside for this time being.
And then the third model includes Examination + Education + Catholic +
Infant.Mortality.
So, all the terms.
So now, I have three nested models and I'm interested in seeing what happens to
my effect as I go through those three models.
The point being, in this case, you can test whether or not the inclusion of
the additional set of extra terms is necessary with the ANOVA function.
So I do anova fit1, fit1 and fit5.
That's what I named them, one, three, five.
And then you see down here, what you get is a listing of the models.
Model 1, model 2, model 3 and then it gives you the degrees of freedom.
That's the number of data points minus the number of parameters that it had to fit.
The residual sums of squares and
the excess degrees of freedom of going Df is the excess degrees of freedom of
going from model 1 to model 2 and then model 2 to model 3.
So we added two parameters going from model 1 to model 2, that's why that Df is
2 and then we added two additional parameters going from model 2 to model 3.
So the two parameters we added from going from model 1 to model 2 is we added
examination and education, they're two regression coefficients.
Going from model 2 to model 3, we added Catholic and
Infant.Mortality there to regression coefficients.
With this residual sum to squares and the degrees of freedom,
you can calculate so-called S statistic.
And thus, get a P value.
This gives you the S statistic and the P value associated with each of them,
then here it shows that yes, the inclusion of education examination Information
appears to be necessary when we're just looking at agriculture by itself.
Then I look at the next one it say, yes, the inclusion of Catholic and
Infant.Mortality appears to be necessary beyond just
including examination, education and agriculture.
So if the way in which you're interested in looking at your data naturally
falls into a nest model search as it often does,
I think when you're interested in one variable in specific.
As in this case,
I think this would be a pretty natural way of thinking about the series of analyses,
then some kind of nested model searches a reasonable thing to do.
It doesn't work if the models that you're looking at aren't nested.
For example, if I had the first model or model 2 had an examination, but
not education and the third model had education, but not examination.
This wouldn't apply, you'd have to do something else.
And there, I think you get into the harder world of automated model selection with
things like information criteria.
So I would put all that stuff off to our prediction class and
just leave you this one technique that's useful in the one specific instance
where you've decided to kind of look along a series of models,
each get increasingly more uncomplicated, but including the previous one.
So, I hope in this lecture that you've gotten a couple of model selection
techniques that you can use.
I hope you've also learned that there are some basic consequences that occur, if you
include variables that you shouldn't have or exclude variables that you should have.
These has consequences to your coefficients that you're interested in,
they have consequences to your residual variance estimate.
We didn't even touch on some other aspects of [INAUDIBLE] model that could occur,
such as absence of linearity and other things like that, non-normality and so on.
So again, it's generally necessary to take your model that's the a grain of salt,
because more than likely one aspect of your model is wrong.
And I'll leave you then with this famous quote by George Box who vary
famously said, all models are wrong, some models are useful.
And I think that's a very credo to go along with that yes, for
sure your model is wrong, but it might be useful in the sense of being
a lens to teach you something useful and true about your data set.