Our next topic within text summarization is a very popular way to summarize texts or to detect topics in a body of texts, which is called LDA Topic Modeling. Now, this is from Dr. Blei. Topic Modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time. Really, Topic Modeling was one of the first accepted or popularized ways to analyze topics of conversation or topics within text in a computational format. What I'm going to try to do from this figure is explained at a high level how LDA Topic Modeling works. I believe the best way to explain how it works is to go backwards as to how a computer envisions how documents are written. Let me say that again. How we're going to explain that is to think about if a computer was writing these documents, how would it be writing these documents when we're considering the LDA Topic Modeling framework? What you see here, both on the left-hand side and the right-hand side are topic boxes. On the right-hand side you can see the words that are included in this topic boxes and then on the left-hand side, you can see the percentages upon which the computer believes that any one of these topics should make up in the documents themselves. For Topic 1 on the right side, you see dog, cat, horse, and so on the left-hand side you see Topic 1 as 33.3 percent. Basically, the computer when it creates these documents it's going to make sure that one-third of the time you're going to see words such as dog, cat, horse, and so on and so forth from Topic 1. Topic 2; computer, hard drive, monitor and then on the left-hand side, 33.3 percent, so again, a third. You can imagine you see that arrow that says creation on the left-hand side, that a computer starts creating these documents, and in this case we have three documents. You see the layers Document 1, D2, D3, and you see that a computer is writing. Now, this obviously doesn't make sense because these aren't sentences and there are conjunctions, and all these various things. But you just see a combination of words that reflect the percentages for each one of these topics, so you see computer, dog, cat, apple, pumpkin, hard drive monitor, orange, horse in Document 1. Now, if you understand that this is how a computer through an LDA Topic Modeling framework would construct a document, then how LDA Topic Modeling works is that with the computers assumption that this is how documents are created, the computer is going to backtrack and try to figure out, how do all these words and all of these documents that you've now given me come out with regards to these percentages, 33.3 percent, and it could obviously be different percentages, as well as what topics they're a part of? Now, even as I just explained that, I just wanted to mention even for me that all these years of conducting LDA Topic Modeling and knowing how this works, it's still confusing. I would like to say that part of the reason why it's confusing is because computers make these jumps that necessarily we as humans don't naturally think like, and so I would just like to mention that if this is a bit complicated, I understand. If we assume that this is how computer creates these documents, then the LDA Topic Modeling process is just going backwards, backtracking. The first step in LDA Topic Modeling is to go through each document and randomly assign each word in the document to one of the K topics. Let's just pretend we have three topics like before. For Document 1, Document 2, Document 3 you go word by word and just randomly assign them to Topic 1, Topic 2, Topic 3. Now, you'll see in a second point, notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics. We could just stop here, and we would have a horrible calculation of the three different topics and the words that are associated with those topics because we just randomly assigned them. That's why it says, albeit not very good ones. But now, to improve on these calculations, to make sure that the words are actually in the right topics, Topic 1, Topic 2, or Topic 3 for each document, Document 1, Document 2, Document 3. We go through, again, each word. Now, the key is that every time we hit a word, we assume that all the other words are correctly assigned to a topic, and so only the word that we're looking at right now is not correctly assigned. I put the mathematical notations here. You don't have to focus too much on those, but these are what are frequently used. I did want to put them in the slide. But for each topic, we just need to compute two things. We look at with the variable x, the proportion of words in document d, the current document that we are on, that are currently assigned to topic t. What we're asking is, how many of the words that are in the current document are already assigned to Topic 1 or Topic 2 or Topic 3? Then we take for the calculation of variable y, the proportion of assignments to topic t over all the documents that come from this word. The word that we're currently looking at, how many times is that word assigned to Topic 1 or Topic 2 or topic t or Topic 3 across all the documents? Then we do a probability function. X times Y, the probability that topic t, whatever that might be, 1, 2 or 3, generated the word that we are looking at right now. With that probability function, we will then basically assign that word to a new topic or keep it in the current topic. Remember in this previous step, we're assuming that all topic assignments except for the current word in question are correct, and then iteratively updating the assignment of the current word using our model of how documents are generated, what I talked about in the beginning as to how computer would generate these documents. Now this sounds like magic and it sounds like this doesn't even make sense as to how this works, but what's interesting, and this is what's called actually a Gibbs sampling process. If you want to look a bit more into this, you could do a bit of reading on the mathematical process called Gibbs sampling. But after you repeat this process a number of times, a large number of time, which for a computer just over, and over again, you eventually reach a rough steady state where your assignments are pretty good so that these words jive together: data, number, computer. Now I do want to make a point that when you do LDA topic modeling many times, even though the computer does numerous, numerous iterations, the words don't look like they come together sometimes. That's again, this process is a probability based process. Sometimes it doesn't come as clean as this. But if it is as clean as this, here's the thing about LDA topic modeling because most people will take this method and say, "I just run LDA topic modeling and then I'm done." That's just not true because the humans still needs to take the step of looking at all the words that are clumped together, that are clustered together, in this case, data, number, and computer, and the human needs to decide what is the topic being discussed. I would say technology. You might actually say something different. You might say computers. Human estimate that call. Or if these words: brain, neuron, and nerve come up as clustered together, then I might say the topic is nervous system. That is in general how LDA topic modeling works. What we'll do now is we'll go into the Social Media Macroscope again within the Smile tool, and we will actually apply running LDA topic modeling on social data. We will now see a demonstration of LDA topic modeling from within the Smile tool within the Social Media Macroscope. Now in the previous video, we pulled a data set, Nike, full of Nike tweets. At this point, we can just go to Analytics Tools here at the top, we can click on "Topic Modelling". From here, just like the previous steps of text preprocessing as well as automated phrase mining, we can select the data set Nike, see a preview, and then we have to choose a number of topics. With LDA topic modeling, one of the things that you have to select in the beginning, which is a parameter of this method, is how many topics you believe are within the data set. Now this is a process in which you can calculate via two different scores. One is called the perplexity score, the other one is called the coherent score. We didn't go too much into those specific scores in the lecture that I previously gave, but those are not necessarily the focus of this introductory LDA topic modelling exercise. For now, we will just choose five topics. We will choose the LDA Jensen Package, which you can read more about right here. Click on "Submit". Then once again, just like with automated phrase mining, you need to put your email address, and then click on "Submit". Once you do that, you'll wait, and after a while, once the job is finished, you will get an email saying that your topic modelling has finished, and then you can go to Past Results. You can go to Topic Modeling here on the left-hand side. I've tagged this with topic, Nike, five topics. If I click on these results, once again, we see some downloadables as to various output results for this package. We see a really nice interactive topic modelling visualization, where right here you can see for Topic 1, these are the words that are part of Topic 1, and then Topic 2. I'm clicking on these here, you can see the words. Remember in topic modelling, the goal is to take these words that are statistically related to one another and try to discern what the topic is here. It's not necessarily at first glance very discernible, and then you actually also get four or five topics, the perplexity score, and the coherent score. Long story short, you could actually do an experimentation of four topics or six topics, 7, 8, and you'll get various perplexity and coherent scores. The way that you actually choose what is the maximal number of topics is that you will graph out the perplexity and do a coherent score and try to figure out at what point do those scores seem to level off. That's when you will find the appropriate number of topics to choose for your data set. But again, we're not getting too into the method of LDA topic modeling, but rather this is just an introduction of LDA topic modeling and how to conduct that method via the Smile tool. Thank you.