0:00

In this lesson we're going to talk about Pearson correlation,

which is oftentimes referred to as Pearson's r,

Pearson product-moment correlation coefficient, or the bivariate correlation.

And it's a way to determine the correlation between bivariate data,

which means data that has two variables.

But what is correlation?

Well correlation is a linear relationship, or lack thereof, between two variables.

And Pearson's r is a measure of the strength of that linear correlation.

So we have a nice little graph here to show you some different values for

Pearson's correlation.

Pearson's r can be between -1 and 1, inclusive.

So a negative value implies that there's a negative correlation between the two

variables.

If there is there is a positive value for Pearson's r,

then there's a positive correlation.

0:56

Then a Pearson's r value of 0 means that there

is no correlation between data points.

So here we have a perfect correlation of -1.

Here we have a perfect positive correlation of positive 1.

And here we have a negative correlation that's not perfect.

Here we have the positive correlation that's not perfect.

And then here, it's very clear that these set of data points

have no correlation whatsoever.

1:20

Okay, now that we have a understanding of what correlation is,

let's use some real data.

We're going to scroll down here and we're going to solve some dependencies.

But then we're going to go ahead and connect to our MongoDB Atlas cluster.

And we're going to go ahead and use the movies data set.

And specifically, we're going to be building a pipeline here that's going to

be looking for movie ratings and movie votes for those ratings.

And we're going to try to determine if there is a correlation between

the number of votes and the actual rating that a movie has.

So in this pipeline we're going to use match stage to make sure that we're

getting documents that have both non-0 values for ratings ad votes.

And then we're going to go ahead and use the project stage to remove _id and

keep the two values that we care about.

And we are going to go ahead and rename them to rating and votes.

Once we have our pipeline, we can go ahead and pass it to the aggregate command and

then turn it into a list.

And then from that list we can go ahead and turn it into a DataFrame,

using the from_dict function.

And now that we are in ourPandas DataFrame, we can go ahead and

take a peek at our data.

And, as you can see, we now have our data in our DataFrame.

And from here we can go ahead and

use Seaborn's joinplot method to visualize the entirety of our results.

It's also going to go ahead and fit our regression line on our results as well.

And there we go. And it looks like we do have some

correlation.

You can see we have a Pearson's r value of 0.15.

And we can see that,

moreover, just by looking at the data but without looking at the line of best fit,

we can see that as a movie's rating increases, so does the number of votes.

So there seems to be a positive correlation, even though that's a tiny

positive correlation between the rating of a movie and

the number of votes that it received.

3:09

But let's go ahead and calculate Pearson's r by hand.

And this is the formula for

doing a single-pass calculation tf Pearson correlation by hand.

There's also a multi-pass form.

But we're not going to cover that in this lesson,

because the single-pass can actually be done in aggregation.

We'll first going to go ahead and do this calculation in Python.

And then, after we have seen how it's done in Python,

we're then going to go ahead and see how it can be done in aggregation.

So there's a bit of groundwork that needs to be done before we can go ahead and

calculate Pearson's r.

Basically, we're going to go ahead and go through here and find each of these terms.

For every value of x we're going to go ahead and subtract the mean from it.

We're going to do the same thing for y.

And then, for these pairs of values, we're going to go ahead and

multiply them together.

And then we're going to go ahead and use those differentials again from above,

and we'll calculate their square.

And then, once we have all these different values,

we can use them together to kind of create this formula.

The first thing I'm going to do is go ahead and

make a copy of our original data frame.

I'm going to call it exm.

4:14

So the first thing we're going to do is calculate the mean of x and the mean of y.

So it's as simple as taking the sum and dividing by the total number.

We're going to store this in m_x and m_y.

And there you can see our mean for x is 6.3, so that our average rating for

a movie is 6.3.

And then we have the mean for y, which would be our average number of votes,

which is about 11,700.

We can now go ahead and calculate little x and little y,

as well as xy, and x squared, and y squared.

So here we're going to go ahead and map over all the values of x,

subtracting the mean.

We'll do the same thing for y.

We're then going to zip up our ratings and votes together and map over them.

And then multiply every pair together.

We're going to call that xy.

5:04

We're then going to square every value for x and

y by mapping over all of those values.

And then we have x, y, xy, x squared, and y squared.

And then we're going to go ahead and

assign all these values into our data frame.

Now let's go ahead and take a look and see what that looks like.

5:23

And as you can see, we now have a nice little data frame where we have

our original ratings, our original votes.

And then, for every one, we have an x value, a y value, an xy, an x squared,

and a y squared.

5:36

Not that we have our data frame, we can go ahead and dive into the equation itself.

First we're going to just focus on the numerator.

We're going to call this top.

We're going to begin by by multiplying the number of elements,

which we've got up here, by the sum of all of those x, y multiples right here.

So we're just going to multiply those two together.

And now we have the product of those two stored in this variable.

Next we're going to go ahead and sum up all the x values and the y values, so

all of the ratings and all the votes.

Multiply those two guys together, sort of, in that variable.

And then finally we're just going to take the difference between those two and

we're going to call that top.

And that's a very large number.

6:14

Now, let's go ahead and focus on the bottom part of our equation.

And for the moment we're going to ignore the square roots and

we're also going to divide it into a left part and a right part.

So here, we're focusing on the left part.

And first we're going to multiply the number of elements by the sum of

the squares, and we're going to call that product_sum_x2_elements.

And then we'll go ahead and subtract the sum of the squares of x, or ratings,

from that.

And that will be on our bottom left.

We can now go ahead and focus on the right-hand part of our denominator.

And this is very similar to the left-hand side,

but now we're concerned with y instead of x.

So we're going to do the same thing,

we're going to multiply the number of elements by the sum of the squares of y.

And we're then going to take that and

subtract the sums of the y squareds from that.

And we're going to short cut here and now we're just going to take the square roots

of the bottom left times the bottom right.

And that'll be our denominator.

And then, finding Pearson's r is as simple as dividing the top by the bottom.

7:16

And we get 0.1464.

Let's go ahead and compare this with the pearsonr library from SciPy.

And as you can see we get the same number which moreover

is actually 0.146 the same as the 0.15

with some rounding that we got with Seaborn.

Both methods work, both doing it in Seaborn and

both doing it by hand, both work.

But they're both being done in Python, which is slower than it needs to be.

Not only is it slow, but

we're also transmitting a lot of data from MongoDB and sending it here to the client.

All that data could just be processed directly on our MongoDB cluster,

reducing the need for transferring data and doing this analysis in Python.

To remedy this, we're going to use MongoDB's Aggregation Framework.

Let's see how.

First thing first, we're going to go ahead and create aliases for our two values, and

y, just so we're speaking in the same terms as before.

And then, just like before, we're going to go ahead and

figure out the number of elements we have.

We're going to sum up the x's, sum up the y's, sum up the squares of x and y.

And sum up the multiples of x and y.

We're going to go ahead and insert these into a group stage and

then assign it to a variable called all_sums.

Next we're going to go ahead and assemble the top part of the equation.

Aside from using aggregation syntax, it's identical to what we did above.

8:39

And similarly, for the denominator, assembling the left and

the right side is exactly the same as what we did above, but now just in aggregation.

And like before, assembling our bottom is as simple as multiplying the left and

right together and taking the square roots.

We're then going to go ahead and project out the correlation,

calling it m, just by dividing the top by the bottom.

We can now go ahead and assemble all of our stages together.

We are going to go ahead and do a match, like before.

We're going to go ahead and get all of our sums and

finally calculate our correlation.

9:12

Now that we've assembled our pipeline,

we can go ahead and execute it by using the aggregate command.

And we're going to go ahead and

compare it against the other values that we've calculated.

And, great, we got the same results for all three variables.

The major difference here is that we didn't need to marshal any data into

a data frame and we were able to have the entire data set be executed, server side,

with MongoDB.

And that's how we calculate Pearson correlation in MongoDB.