This lesson is going to explore statistical issues that might

arise when you begin to explore multi-dimensional data sets.

So far you've seen how to visually analyze two dimensional data sets,

and you've also learned about analyzing them

analytically by using multi-dimensional arrays.

This lesson is going to focus on some issues that

come up when we start looking at multi-dimensional data sets.

In particular we're going to look at paradoxes of

probability and how to avoid them when you're looking at dimensional data,

and we're going to look at statistical misinterpretation and how to

be careful not to let your statistics lead you astray.

And lastly we're going to look at a fun website called spurious correlations which

reinforces the idea that correlation does not imply causation.

So first you're going to read this article on

the conversation about different paradoxes of probability.

And these are concepts such as Simpson's paradox where you make

a measurement and you start interpreting it and it

looks like the result is different than you would have thought,

typically involving an aggregation.

And yet when you separate the data out it looks different and

trying to understand why is what leads to Simpson's paradox.

There's other ones like the base rate fallacy,

there's others as Will Rogers paradox, et cetera.

These provide interesting insights

into how you can be led astray by statistical analysis,

particularly with multi-dimensional data sets.

The next article is also in the conversation,

the seven deadly sins, a statistical misinterpretation.

It sounds really bad but that's mostly to get your attention.

So one thing to be careful about is looking at data and not

realizing that there's things that may not always be present.

So the example they give here you look at these two bar charts

here and this looks like it's quite significant, this difference.

But if you actually have an understanding of what the error is on each measurement as

shown in the right panel you realize that the differences are within the errors,

and thus it's unlikely that there's an important difference between these two data sets.

Another thing is that sometimes you see that statistical significance

implies something is important but when you

really look at it in the real world that's not true.

And that's often an issue with sample size,

that if you have a small sample variations can

be large and thus depending on which sample you get,

you might have a different result.

The rest of this article goes through similar examples and these are important to

see the things that you need to be careful about as you look at data sets.

The last one is a very fun website that I like.

It talks about what are known as

spurious correlations and the idea here is that often we look

at a data set such as these shown here and

you think wow these two data sets clearly are correlated,

there must be some relationship between them.

But what this is showing us is the spending by the U.S. government on Science, Space,

and Technology correlated with suicide by hanging, strangulation, and suffocation.

There should be no correlation between these two data sets.

And so this is what's known as a spurious correlation.

One clue to this would be the different sides,

the different labels on each side.

But there's many others that you can look at here.

The correlation between the number of people who drowned by falling into

a pool correlates with films Nicolas Cage appeared in, and et cetera.

You can go through these and see different ones.

Here's a very high correlation.

There's our R value that we learned in a previous notebook,

that's quite high, quite close to one.

So clearly there's a really strong correlation in

per capita cheese consumption and the number of people who

died by becoming tangled in their bedsheets.

Anyway, as I said this is a fun website to look at and to see that,

you know what, correlation does not imply causation.

You can see that there's a relationship between

two data sets but then you have to really think analytically,

think is there really some reason these should be correlated,

is there some cause for this correlation,

and in some cases there is and that provides

you very important and unique insight into the data.

Often that's what we're after and that's why we're showing

the visualization and we're showing the correlation coefficient to help

convince people that there is a correlation and then

that correlation has an important cause within the data set.

That's the model that's generating that data.

So hopefully you've learned to be careful about interpreting data,

particularly when we start going to multi-dimensional data set,

there's a lot of new things that come into play that we have to be careful about.

You should feel free to discuss these on the course forum,

if you find anything else on this correlation causation website,

maybe you can share that.

And of course if you have any questions,

let us know. Good luck.