[MUSIC] Let's look at the next follow-up question that helps elicit the reasons why groups may not be probabilistically equivalent. The question is, is there a confound that could explain the outcome, or what other than group membership could explain the outcome? This question is the data science equivalent of asking, where we unlucky? Consider this example. Suppose an auto manufacturer who specializes in full wheel drive vehicles experiences slow sales during the first three weeks in January of 2015 in the New England sales region. As a result, the manufacturer decides to offer a sales incentive of 2,500 for any customer who purchases a car between January 22nd and February 11th of 2015. The results are very encouraging. Sales in the New England region increased by 33% during the sales promotion, compared to the prior three weeks in January. This type of increase does not usually happen, and the promotion is considered a success. Now let's ask the question, is there a confound that could explain the outcome? Well, if you lived in Boston during the winter of 2015, this is an easy question to answer. Between January 26th and February 10th of that winter, Boston experienced two of the top ten winter storms of all recorded time. Why does this matter? Well, traditionally vehicles from that manufacturer, because they have a reputation for handling extremely well on the snow, sell well during periods of heavy snowfall. The heavy snowfall is the confound. Both snowfall and promotions tend to increase sales. In this case both coincided in time and this breaks probabilistic equivalent. Promoted and not promoted cars differ by more than whether the manufacturer offered a sales incentive. They also differ by whether they were sold during a period of heavy snow. So is there a confound that could explain the outcome? Is the third and last follow up question when we have trouble answering directly whether there are any pre-existing differences between groups. This question is particularly useful when we make comparisons over time. Many things tend to change over the course of the year, and so when we compare groups over time, it is pretty common that groups are not probabilistically equivalent. Asking whether there may be confounds, anything that changes over time and might affect the outcome helps catch that. This completes our checklist. Now if we can answer all the questions with a clear no, we are golden. What our analytics shows can be interpreted as cause and effect. For example we can say our ad increased revenues by 28%. We can say this because the group of consumers who saw the ad spent 28% more than the group of consumers who did not see the ad. And there were no preexisting differences between the two groups. If we are forced to answer any of these questions with a yes, we have a problem. We can no longer interpret our analytics as saying something about cause and effect. The groups are not probabilistically equivalent. Now, if you reach the end of the questions and you say, I am not sure how to answer one or more of these questions, there's a good chance, you don't know your business as well as you should. Let me give you an example, here's a causal statement that you will immediately recognize as nonsense, ready? America is in a law enforcement crisis. And the reason I know this is because wherever the police show up, crime rates increase. Take a second to think about this. I started with data that shows that crime and police presence are positively correlated. But instead of concluding crime causes police to show up I interpreted this data as police presence causes crime to occur. Why did you recognize this so quickly as nonsense? Because as a citizen, you have deep domain expertise in how law enforcement resources are allocated. You experience that police respond to 911 calls. That they respond to accidents and so on. Now think back to the Pentathlon case. If you had been intimately familiar with email targeting at Pentathlon, how long would it have taken you to realize that the table that Francois Capri presented should not be interpreted as showing that higher email frequency causes revenues. Not very long at all. You would have known that the decentralized email policy makes Pentathlon sent more emails to better customers. Here's what I want you to take away from this. Domain expertise is one of the best weapons against bad analytics. Data scientists benefit tremendously from business people with deep domain expertise. Not only does this help them work on problems that are important to the business, it also improves the quality of the analytics. This is another reason why analytics needs you. [MUSIC]