[MUSIC] So let's consider a classic application of Bayes' rule to a big data problem, which is spam filtering. So here our task is to determine whether an email message is spam. The probability that an email message is spam, given the words in the email message. Okay, and with Bayes' rule, you can express that probability as the probability that the email message is spam overall. Multiplied by the probability of seeing these particular words in the message, given that we already know it's spam. And all of that divided by the probability of seeing these words in the message. Now, the interesting one here is this numerator. And the reason is is that the probability of words appearing doesn't involve the unknown label of whether it's spam or not. And so all we're trying to do is get a relative frequency of spam or not spam. Okay, and so just dividing by a constant factor of the probability of seeing these words doesn't actually change our decision at all. So we don't actually care about the actual number, we just care about the decision of spam or not spam. Okay, so fine. So, re-expressing this before we get rid of the denominator, re-expressing this, what do we mean by words? Well, you can rewrite this as the probability that the email message is spam, given that the word viagra appears in the message. Given that the word rich appears in the message, given that the word, something more innocuous perhaps like friend, appears in the message. All the words in the English language, or all the words of interest to us in this test. Okay, so that was re-expressed that way. Now, this numerator can be rewritten in the following way. Given that it's a conditional probability, we can apply a chain rule, a repeated application of the definition of conditional probability to obtain this. The probability that it's spam, multiplied by, so let's see. So this expression rather can be expressed as the probability of seeing the word viagra, given that it's spam, multiplied by the probability of seeing all these other words, given that it's spam. And given that the email message contains Viagra, and you can keep going. This probability times, probability of seeing the word rich, given that it's spam, and given that it's viagra, multiplied by the probability of all the other words. Given that it is spam, given that it contains the word viagra, given that it contains the word rich, and so on. Okay, so this is a long, complicated, conditional probability. And this is where the Naive Bayes assumption comes in. So under the Naive Bayes assumption, we say that the probabilities of these different words appearing in the email message are completely independent. That it's no more likely for you to see the word rich when you see the word wealth, than it is without the word welfare. This isn't true. Obviously words go together, there are co-occurrence rates, but you just ignore that. And just treat everything as completely independent, which allows you to simply this expression as just a sequence of probabilities. What is the probability of seeing the word viagra, given that it is spam? What is the probability of seeing the word rich, given that it's spam and so on? Now, how do you get these probabilities? Well, you have data, right? You have a set of documents that have been pre-labeled as spam. And you can look at the number of them that contain the word rich and divide by the total number. Okay, and so now you can calculate the probability of the two classes, spam and not spam, and apply a decision procedure to call it. And in fact, a simple one is just whichever one is more likely, whichever one has the higher probability. It's called the MAP decision rule which stands for the max a posteriori. [MUSIC]