Hello. This lesson is going to introduce
scatter plots as a technique for visually exploring two-dimensional data.
So far, we've been focusing on analyzing
a one-dimensional dataset which is effectively one column in a data frame.
And now, we're going to start looking at comparing two dimensions, or two columns.
And we do this to see,
do they share a positive,
negative, or even no correlation,
as well as giving us the ability to identify outliers in any sort of relationship?
All of these concepts are important because when we go to learn from data,
we need to know if there are relationships inherent in the dataset,
as well as how do we find outliers or data
points that do not follow the trend that the rest of the data follow?
This lesson will be following the introduction to scatter plots notebook.
Effectively, this notebook will build on
the previous visualization notebooks that we've used in this particular course.
We, of course, start by setting up our notebook to
have all of the visualizations displayed in line,
as well as doing our standard imports and
setting the warning filter to ignore specific warnings.
Now, scatter plots are quite simple.
We're going to display
the one-dimensional vector x against another one-dimensional vector y.
Just like with the plot method,
we're going to be using the scatter method in
this case because that will actually generate a scatter plot.
Plot will actually connect the points with lines and
we generally don't want that because it will obscure the underlying data.
So how do we do it? Well, in this case,
we're going to generate some data.
In this case, there linearly space between 0 and 100,
and y will be x plus some random noise.
We'll then scatter these points as making a scatter plot, and label our plot.
And what we end up with is this.
And you can see, here is a positive correlation as x increases, y also increases.
Now, some things that you may not have seen before in
the plot is primarily this method here, the despine.
We said, to trim equal true.
Which means, to trim away the excess parts of the plot so that the axis here don't meet.
And we also, offset them so that we can more easily see the relationship.
You should try playing around with these values to see how that affects the plot.
We've also arranged our x and y tick marks.
Remember, this will start at zero,
it will end before 120,
and we're going to do it in strides of 20.
So we will have 0, 20, 40, 60,
80, and 100 on both the y and x axis.
Now, we could also make a dataset that is negatively correlated and display that.
A negatively correlated dataset,
as one variable increases,
the second one decreases.
The next correlation is a null correlation,
where there is no correlation and this would be a good example of that.
As x increases, y doesn't show any distinct trend.
Now, one other important point about
a scatter plot is that we cannot just find correlations,
we can also see data points that sort of lay a way from the main trend.
So this particular code cell does that.
It makes a positively correlated dataset,
as x increases, y increases.
But we also have these two data points over
here that are clearly outliers from the trend.
If we were to do some sort of analysis,
we might determine a model which can model this relationship between x and
y while also trying to understand why are these outliers present?
Is it because a machine was incorrectly reporting values?
Is it because a person entered the wrong data accidentally?
Or, is it a potential case of fraud because somebody intentionally
massaged the data to better reflect on themselves?
So visually, looking at data,
makes it easy to see a trend or to spot outliers and that's one of
the clear benefits of actually using scatter plots.
Now, in the previous examples,
we looked at just one relationship.
In this case, between x and y.
But we can also look at multiple relationships.
So first, we're going to load in the Iris dataset and
compare the sepal length versus the pedal length.
We could also look at comparing multiple datasets.
In this case, we're going to look at the sepal versus pedal comparison.
And what we've done here,
if we come up and look at our scatter plot,
we are comparing in red,
the sepal length versus the pedal length,
and in blue, the sepal width versus the pedal width.
So this is two different relationship shown on
the same plot and we've distinguished them by color coding.
We could actually add in here a legend to indicate those differences.
And we could do that easily if we simply added
a label flag to this particular scatter plot saying,
'length comparison' and another label to this one saying,
'with comparison' and then we called 'legend'.
We'll see examples of this in later notebooks.
We can also compare datasets to trends and we can also compare multiple scatter plots.
Here is a similar example to the rug plot that we saw.
But in this case, it's actually a scatter plot where we're
trying to see the correlations as it might exist.
But perhaps most importantly, when we try to do this,
there is a built in function in Seaborn that creates what's called a pair plot.
I like to think of it as a spreadsheet plot.
And that we are plotting,
different columns or features against other features.
So you can see the first column is sepal length,
the second column is sepal width et cetera,
and the first row is sepal length.
Now, the diagonal elements of this array of plots is sepal length against sepal length.
So the way we represent this,
is by actually drawing a histogram instead of a scatter plot.
The Off diagonal elements then,
are actually the scatter plots.
And so, you can see that it's symmetric.
Any plot that's down here,
is reflected on the other side of the diagonal.
And here we are color coding each plot by
the three different Iris species that are present.
So this shows you, it's a real simple way if we actually go up here and look,
it was one line of code once we've read in the data frame.
This is the Iris data frame to make of this plot.
And it quickly shows the clustering that's present in the data.
The Setosa is off by itself,
and diverse color in the virginica are somewhat separated here.
We also have nice trends between
these nice positive correlations between these variables.
We also have a nice positive correlation
between these two species and this particular plot.
And this one's different. We can also see some outliers in specific examples.
So you can see very quickly this pair plot makes
a very powerful visualization when you're just starting to explore
your dataset in terms of giving you clues
to relationships that you might want to explore in more detail.
So a good example, pedal length, pedal width.
We can clearly see that sort of linear positive relationship.
I hope this has given you
a nice introduction to the power of scatter plots and the ability to
use these two-dimensional visualizations
to better understand what's going on in your data.
If you have any questions,
be sure to let us know in the class forum. Good luck.