This week you'll create an n-gram language model from a text corpus. Recall that a text corpus is typically a large database of text documents, such as all the pages on the Wikipedia, all books from one author, or all tweets from one account. A language model is a tool that's calculates the probabilities of sentences. You can think of a sentence as a sequence of words. Language models can also estimate the probability of an upcoming word given a history of previous words. For example, if you begin an e-mail, Hello, how are, your e-mail application may guess that the next word you want to write is the word, you, as in, Hello, how are you? In the assignment, you'll build your own n-gram language model and apply it to autocomplete a given sentence. First, it will process a text corpus into a language model. Users of your autocomplete system will provide the starting words of a sentence. The model that you will build will then allow you to query for the most likely words following the starts of that sentence. Then it outputs that as a suggestion to complete the sentence. While you are going to focus on the autocomplete application this week, language models are widely used in other areas as well. Language models are used in speech recognition to convert the outputs into real words. For example, if an acoustic model converts speech to text and adheres, I saw a van. Then the speech to text system can use a language model to know that it's more likely the sentence was, I saw a van and not eyes awe of an. Language models are also used in spelling correction to identify words that should be replaced based on the sentence there in. Probabilities are also important for augmentative communication systems. These are systems that can take a series of hand gestures from a user to help them form words and sentences. People like Stephen Hawking, who are unable to physically talk or sign, can instead use simple movements to select words from a menu. Then the system can speak for them. Word prediction using language models can be used to suggest likely words for the menu. Here's what you'll be doing this week. First, you'll transform your raw text corpus into a language model which returns the probability of the next word by using the previous words of a sentence. Next, you'll adapt your language model to deal with words the model hasn't seen during training. These words are called out of vocabulary words. Smoothing is another technique that you can use to deal with previously unseen input. Probability of unseen words and sentences can still be successfully estimated this way. Smoothing essentially pretends that each word and phrase appears in the training corpus at least once. This helps to calculate the probability even for unusual words and sequences. Finally, I'll show you how to choose the best language model with the perplexity metric, a new tool for your toolkits. When you combine these skills, you'll be able to successfully implement a sentence autocompletion model in this week's assignments.