Welcome back. In this video, I will show you how you can model whole sentences using engram probabilities. This will allow us later to generate text. So let's find the probability of a sentence or an entire sequence of words. How would you calculate the probability of the sentence, the teacher drinks tea. To get started, let's refresh your memory of the conditional probability and chain rule. Here, the conditional probability is a probability of word B. If word A appeared just before B you can express the rule for the conditional probability in terms of the joint probability of A and B, which is P (A,B). So the probability of B given A is equal to the probability of A and B divided by the probability of A. You can rearrange this rule so the probability of A and B is equal to the probability of A times the probability of B given A. This can be generalized to the chain rule which describes the joint probability of longer sequences. So the probability of a sentence with word A followed by word B followed by word C and word D is equal to the probability of word A. Times the probability of word B given word A times the probability of word C given the word A followed by B. Times the probability of word D given by words A followed by B followed by C. That was a long one, but it should make intuitive sense as your eyes follow along the equation. The probability of each successive word is the product of the probabilities of all words that came before it's in the sequence. You can now apply the chain rule to answer your question about the probability of the sentence, the teacher drinks tea. The probability of the sentence, the teacher drinks tea, is equal to the probability of D times the probability of teacher given D times the probability of drinks given the teacher times the probability of tea given the teacher drinks. So you see the ideal scenario where you actually have enough data in the training corpus to calculate all of these probabilities. However, this direct approach to sequence probability has its limitations. Since natural language is highly variable your query sentence and even longer parts of your sentence are very unlikely to appear in the training corpus. Let's use the chain rule probability for the sentence, the teacher drinks tea. The formula would look like this. For the final term look for the counts of both the input sentence, the teacher drinks tea, and its prefix, the teacher drinks. Since neither of them is likely to exist in the training corpus their counts are 0. The formula for the probability of the entire sentence can't give a probability estimate in this situation. As the sentence gets longer, the likelihood that more and more words will occur next to each other in this exact order becomes smaller and smaller. So the likelihood that the teacher drinks appears in the corpus is smaller than the probability of the word drinks. What if instead of looking for all the words before tea, you just consider the previous word which is drinks in this case. This is what those calculations would look like. The probability of teacher given D. The probability of drinks given teacher. And the probability of tea given drinks. If you multiply these together you'd get an approximation of the whole sentence, the teacher drinks tea. Once the Bigram approximation is applied, the formula to estimate the sentence probability is simplified. So now it's the product of conditional probabilities of words and their immediate predecessors, in other words, the product of the probabilities of Bigrams. You just apply the Markov assumption instead of looking for all the words, you just looked at one previous word. You also could have considered the two previous words or any n number of previous words. The Markov assumption states that the probability of each word depends only on its limited history of length n. It ignores all the words this came before specified length N. So for Bigrams, you approximate the conditional probability of a word Wn given the history W1 to Wn- 1 by the conditional probability of the word Wn given word Wn- 1. That means you only need to look at the previous word. With N-grams, you'll only look at n- 1 previous words when calculating the conditional probability in the chain rule. The by ground formula to estimate the sentence probability is now the product of conditional probabilities of the words and their immediate predecessors. As you may recall, this is in contrast with naive Bayes where you approximated sentence probability without considering any word history. The first term of the Bigram formula relies on the unigram probability of the first word in the sentence. That's coming up next. You now know how to compute the probability of a sentence given N-gram probabilities. Next I'll show you how to compute these N-gram probabilities from a corpus.