Hello and welcome to Lesson 1.
This lesson is going to introduce a very important concept called overfitting.
And, it's going to discuss why this is bad and issues that
might arise if you do not account for the fact that this may be occurring.
Overfitting is effectively the bane of machine learning and data analytics.
With overfitting, you can learn too well from
your training data and fail to generalize when your model is applied to new unseen data.
This may give you over confidence in the accuracy of your predictions.
It may lead you to make predictions which are unfounded and wrong.
And, when you think about the effect algorithms are starting to have on our daily life,
the consequences can be real and painful.
So, it's extremely important that you learn about
overfitting and that you always are afraid
in the back of your mind that your model which looks
really good may actually be over fit.
There are several readings that I want you to go through for this particular lesson.
They all harbor around the idea of overfitting and
what it is and what are the problems that might arise from overfitting.
This particular article talks about different issues with data and problems that can
arise when you are analyzing data and failing to
account for overfitting and the impact that that might have.
I encourage you to read through that and think carefully about it
because it's really going to help hit
home the issues that might come about with overfitting.
Second, there's a Wikipedia article and I really like this article.
It's a very easy article to read.
Sometimes that's not the case with Wikipedia, but,
this is actually a very easy article to read
and really demonstrates the nature of overfitting.
So, for instance, here we have
a black line which has a pretty reasonable job of separating the data.
Not perfect, but reasonable.
And then, we have the overfit line which is green,
which carves out this complex shape and manages to keep 100 percent perfect prediction.
Any time you start to see models get really complex,
that's when you start to think overfitting may be a problem.
Third, maybe you've heard of XKCD,
if not you probably should.
It's a wonderful comic strip that demonstrates
important technical or scientific issues with a touch of humor.
So, this actually shows overfitting and it uses
the precedents of previous presidential elections to talk about a model.
So, for instance,1788, no one
has been elected president before and then there's the president.
So, now we modify the model.
No incumbent has ever been reelected until Washington.
And you just keep going through and it keeps changing,
the model keeps getting more and more complex,
until you basically end up with something which is extremely ridiculous.
And there you have it. Nice little demonstration of overfitting.
The last article is actually going to talk about ways
to identify overfitting and ways to reduce overfitting.
Now, you've already seen things like this.
This is a comparison between butter production in Bangladesh and
United States cheese production and sheep production in Bangladesh and the United States.
And you see that there is this amazing correlation as your correlation coefficient ,
0.99. It looks like it's perfect.
We saw this in the module that talked about
statistics and the idea that correlation is not causation.
It's an example of overfitting.
You need to be wary of a model that
looks really good in interpreting things from it that really shouldn't be interpreted.
So, this article talks a lot about different types of overfitting,
things you need to be aware of,
things should need to be careful about and ways to avoid making
mistakes that can affect not just your job prediction or your job performance,
but the application of your models and how they may be causing problems for others.
So, hopefully you can go through these articles,
learn more about overfitting,
become more aware of it and ways to avoid its problems.
If you have any questions let us know. Good luck.