0:05

Welcome back.

The question of why things happen is at the core of just about everything we do

in analytics.

Whether we're looking at what happened in the past, what will happen in the future,

or what we should do about it,

it's in our nature to seek to understand what's really going on.

We want to know why.

We want to know what cause something to happen or

what will cause something to happen.

Not only does that knowledge help us today, but it helps grow our understanding

of how things work and informs how we think about our business.

However, the real world is complicated and really

understanding the causes of the events we see is much easier said than done.

In this video, we're going to talk about the ideas of correlation and causation and

address one of the most common errors that occurs in analytical work,

specifically mistaking one for the other.

So what is correlation?

In the simplest terms, correlation is a mutual connection or

relationship between two or more things.

When we think about correlation in analytics we're usually referring to how

we see characteristics vary in relationship to each other.

When one measure is higher or lower does another measure vary in a consistent or

predictable way?

If so we'd say the measures are correlated.

The simplest way we normally see correlation is when we plot the values of

one measure or variable versus another.

Here's an example.

Here we have a plot of average height versus average weight for

women in the United States.

As we might expect as height increases so does weight.

In fact, the relationship between the two almost looks like a straight line or

what we call a linear relationship.

However, there are number of regular and

irregular patterns we might see when looking for relationships.

Here are a few.

1:51

Some are negative correlations, like the negative linear and

negative exponential pattern.

In these patterns when one value goes up the other value goes down.

And the others are more complicated.

The quadratic, threshold step, and

cyclical patterns all suggest a relationship between the two values.

When we're dealing with data in the real world the relationships we see

aren't this clean.

There's usually some noise or variation in our measures, so

our visualizations are a bit fuzzier.

As we can see in this example diagram the strength of a correlation we see can be

stronger or weaker depending on how tightly related that values are.

These examples of variations on a linear relationship.

But the same would be true around any of the other types of patterns we'd observe.

There are more specific measures of correlation that apply in statistics and

mathematics.

The most common is the Pearson Correlation Coefficient, which measures the degree

to which there is a linear relationship between two variables.

We usually see this value represented by the letter r.

2:48

Now, our objective in this video is not to get too deep into the math itself, so

we won't be presenting the equation that we use to calculate r.

But to make a long story short, if we have two set of measures that are perfectly

positively correlated, we have r=1, like the first diagram.

Conversely if our measures are perfectly negatively correlated,

we have r=-1 like the last diagram.

And if there is no correlation at all we have r=0 like the middle diagram.

Of course it's generally the case that we see something that's in between like

the remaining cases.

You might recall a similar measure that shows up in the result of a linear

regression.

Usually there's something called coefficient of determination or

r squared which is just a square of r.

And like r, it's a measure of the strength of an observed relationship between two

sets of values.

We bring it up here just so you make the connection to that type of analysis.

It's not uncommon for someone to show up with a regression analysis that has a high

r squared value and who wants to jump straight to causality.

Since causality is what we really wanted to get after anyway,

let's shift gears a bit and talk a little bit about that.

Causation means that one event or

state is the result of the occurrence of another event or state.

In other words there is a cause in effect relationship between two or more ideas.

When we see data that applied a relationship a causal relationship is

one option for what's really going on.

But it's not the only one.

Let's assume we have two ideas A and B and we've observe a correlation between them.

How might A and B actually be related?

Well, we might suggest that A causes B.

Conversely, we could also suggest that B causes A.

However, it could also be the case that there is a third factor,

let's call it C, that actually causes both A and

B, such that there really is no causal relationship between A and B.

It's also possible that A does cause B or vice versa but

the causation actually happens through C as an intermediate factor.

Finally it can also be the case that there's actually no relationship at all

and what we are seeing in the data is pure coincidence.

This is where we sometimes get ourselves into trouble.

As data analyst,

we're kind of hardwired to believe that there's an answer in the data somewhere.

And it makes it really hard to accept that there might not actually be a relationship

in something that might look clearly related to our eye.

But it turns out that it's not too hard to find examples of two things that seem

to correlate almost perfectly with each other, but

which in fact are completely unrelated.

If you haven't already visited Tyler Vigen's Spurious Correlations website,

I suggest you pause this video and take a few minutes to scroll through some of

the more entertaining examples of this you'll find.

For those of you who can't get there now, here are a few examples.

5:37

Each one of these show a high degree of correlation between two completely

unrelated sets of value.

So, the point here is we need to resist the urge to assume relationships exist,

when it's possible they don't.

The other mistake we tend to make is that we get too focused on the ideas we have in

front of us and forget to consider the influence of other factors.

The specific error I see more than any other is assuming a causal relationship

between two characteristic or events when there's really a third factor

causing them both or this relationship we saw earlier.

Let's illustrate this using an example.

Let's say that we're a wireless carrier, and

we're trying to assess the impact of people accessing their account online.

Among other things, we look at the simple relationship between historical account

access by an individual and

the likelihood that the individual will cancel their service.

What we basically find is that people who have accessed their account or

half is likely to cancel.

The business manager in charge of customer attention says, this is great!

All we need to do is,

incent people to use the web and we could reduce our cancel rate by half.

6:35

What's wrong with this interpretation?

Well, it depends on whether or not you believe that the act of accessing the web

is influencing someone's likelihood to cancel.

It turns out that it's far more likely that a third factor like age,

comfort with technology, or just willingness to engage with the company,

is driving both the likelihood to use the web and the likelihood to cancel.

Simply incenting people to use the web isn't likely to change

any of these underlying causal factors and

therefore isn't likely to have an impact on cancellation.

This may seem like a really obvious example but it actually happened.

In fact I've seen this exact scenario and rationale around web accessing

cancellation come up not just once but at least

three times in three different places during my time in the wireless industry.

So it definitely happens.

Even very smart people can make silly mistakes sometimes.

So how do we avoid mistaking correlation for causation?

Is there a way we can prove that causation does in fact exist?

More often than not, the answer is no.

Proving causation is pretty hard.

But what we can do is eliminate alternate explanations either through context or

by showing empirically that other relationships can exist.

In our web cancellation example, we used reasoning based on our knowledge of

the industry and likely customer behavior to question the assumption of causality.

Again, context turns out to be critical and interpreting relationships and data.

It should be the first line of defense in avoiding mistakes.

Can we think of any other plausible explanations for what we see?

Is there any other data that would contradict and assume the relationship?

We can also apply simple ideas like temporal precedence.

For a relationship to be causal,

the causing factor needs to be present before we see the effect.

If we see something that we think is an effect happen before its cause,

we know that we have the wrong relationship.

It turns out that one of the best ways to isolate causation is to run a controlled

experiment.

In a simple controlled experiment,

I normally isolate two randomly selected groups of subjects and

apply treatment to one group while not applying that treatment to the other.

I call this the treatment in control groups respectively.

I then observe differences between the two groups.

If I observe a difference between the treatment and the control and I've ensured

that the only thing that differed between the groups is the treatment that was

applied, then I have strong evidence that the treatment caused the difference.

If we have the luxury of an experiment that's great, but if we don't,

we fall back on our context, logic and alternate explanation based approaches for

assessing causality.

So let's circle back to where we started.

As a analyst where constantly asking the question why and

finding causal relationships and data is big part of answering that question.

But it is important to recognize the pitfall that we can fall into and

mistaking correlation for causation.

In this module we'll continue to explore ways we can fail to

interpret data correctly, and learn what we can do to avoid those mistakes.

[BLANK AUDIO]