0:01

The concept of an outlier should not be foreign to you at this point.

We've talked about outliers numerous times throughout the course.

However, in this video, we're going to focus

on outliers within the context of linear regression.

And we're going to talk about how to identify various types

of outliers, as well as touch on how to handle them.

In this plot, we can see a cloud of points that are clustered together,

as well as one single point that is far away from the rest of them.

The question is how does this outlier influence the least squares line?

To answer this question, we want to think about where

the line would go if this particular outlier was not there.

And in that case, there would be

absolutely no relationship between the two variables,

because the pointer completely randomly scattered, so

the line would look like a horizontal line.

0:53

Therefore without the outlier, there is no relationship between x and

y, and this one single outlier makes it appear as though there is.

There are various types of outliers, and depending on

the type is how we decide how to handle them.

In general, outliers are points that fall away from the cloud of points.

1:16

Outliers that fall horizontally away from the center of the cloud but

don't influence the slope of the

regression line are called leverage points.

And outliers that actually influence the slope

of the regression line are called influential points.

1:34

Usually, these points are high leverage points.

And to determine if a point is influential, we want to

visualize the regression line with and without the point and ask...

Does this slope of the line change considerably?

So what type of an outlier is this?

To answer this question, we want to first ask, does this point

fall away from the rest of the data in the horizontal direction?

And the answer is yes, it does.

This makes it a potential leverage point.

But, another question we want to ask is, is it also influential?

Let's try to think about where the line would

go whether the point was there or not there.

It appears that the line would stay in exactly the same place.

So, the outlier point is actually on the trajectory of the regression line.

Therefore it does not influence it.

This makes this point a leverage point.

2:31

And what about this one?

Just like with the previous point, this outlying point also falls

away from the rest of the data in the horizontal direction.

So it could simply be a leverage point.

However, it also appears to be influencing the slope of the line.

If we were to remove this point, the line would look considerably different.

In fact, it would look horizontal since otherwise,

there's absolutely no relationship between x and y.

And therefore, we would identify this as an influential point.

When we are trying to decide whether to

leave this data point in the analysis or take

it out, if it's an influential point, we

want to be very careful about leaving it in there.

Because it's definitely going to affect our estimates and all of the decisions

that we're going to be making based on the results of the analysis.

Here's another example of influential points.

Here we have light intensity and surface temperature, both of

which are log of 47 stars in a star cluster.

We can see that there are two different types of stars, ones

that have a lower temperature and ones that have a higher temperature.

The solid blue line shows us how the regression

model would look if we were to ignore the outliers.

And the red dash line tells us how the regression

model would look if we were to include the outliers.

Those are the four stars with the lower temperatures.

Obviously, the red-dashed line is not a good fit for these data.

So in this case, what we might want to do is actually split our data into two, those

stars that have lower temperature and those stars that

have higher temperature, and model the two groups separately.

4:20

Remember, we don't want to just blindly get rid of outlying

points, because those actually might be the most interesting cases.

Perhaps these stars that are much colder than the

other ones are indeed more interesting to look at.

But what we want to do is we don't want to lump them along with the

stars that have a higher temperature and try to model all of them together.

[BLANK_AUDIO]

One last remark on influential points.

Let's take a look at this statement and evaluate whether it's true or false.

Influential points always reduce R squared.

It is true that influential points tend to make life more difficult.

But is it true that they always reduce R squared?

Let's take a look at these two graphs, one where

which we have an influential point and one where we don't.

The first plot does not have an influential point.

And we can see that the regression line looks fairly horizontal,

indicating that there's little to no relationship between x and y.

In the second plot, we have an influential point that is far away from the trajectory

of the original regression line, and hence pulls the regression line to itself.

In the first plot, the correlation coefficient is very low, just

0.08, and hence R squared is pretty low as well, at 0.0064.

In the second plot, however, all of a sudden, we're seeing an increase in

our correlation coefficient as well as an

increase associated with that in our R squared.

So, even though we would never want to fit a linear model in the second plot,

we are actually seeing a much higher correlation and a much higher R squared.

This is a good lesson for always viewing a scatter plot before fitting a model.

If we were simply deciding on whether or not the model

is a good fit by looking at the correlation coefficient and R

squared, we would never catch the anomaly in the data, and

that there is only one influential point that's driving the entire relationship.