In this module, we will see a really useful model,
called latent dirichlet allocation.
It is used for topic modeling.
In this video, we will see what topic modeling is.
For example, you want to build a recommender system.
You want to recommend books.
For example, if I read the Sherlock Holmes books,
the system would recommend me,
for example, equatorial books.
Let's see how we can do this.
We would like to extract features from the book.
For example, we would like the features to correspond to topics.
We would have a topic about detectives,
we can have a topic about adventures, and horror.
I would try to decompose the book into topics.
For example, the Sherlock Holmes book would be 60% detective book,
30% adventure, and maybe a 10% horror if you read the Mysteries Dark book.
We can define the document as a distributional topics.
We will have a document and for each,
we will assign the probability to meet the topic.
For example, here we'll meet a detective topic with
probability of 0.6 and adventure topic with probability 0.3.
Now, let's see what topics are.
For example, if we have a topic related to sports,
we would expect there be words like football, hockey,
golf, score, and so on,
to appear the most in this topic.
You will have a topic about the economy and we expect Money,
Dollar, Euro, Bank and so on,
to be the most popular words, and finally,
for politics, we'll have President, USA, Union,
Law, you name it.
Let's see how we can use these topics to generate a text.
For example, we want to generate the sentence,
football player from USA has salary in dollars.
The words football, clearly is from the topic sports,
the words USA is from the topic politics and the word dollars is from the topic economy.
We will define a topic,
as a distribution over words.
For each word in the vocabulary,
we will assign a probability to meet this exact word in a sentence.
For example, here are the words football in the topic sports,
happens in 20 percent of cases and all other words have lower probability.
This definition is actually useful for interpreting topics.
The matter that we will see and does not generate the labels for topics.
It will just generate the distribution over words.
If we will see the most frequent words,
we will be able to assign the labels to the topics.
For example, if I didn't show you that the first words are from the topic of sports,
you can clearly say that since the words football and hockey are very frequent,
the topic is probably about sports.
After we would find the topics in the document,
we will try to convey some similarity to say whether
this book is similar to the one that you read, or not.
We will assign a vector for each book.
For example, for the Sherlock Holmes book,
will have a vector with 0.6, 0.3,
and 0.1 corresponding to the probabilities
of the topics in this document. We will call this vector A.
Similar way, we can compute the vector for the equatorial book.
Now, we have two vectors,
we can compute the distance or the similarity between them.
We can use, for example, Euclidean distance or Cosine similarity.
What to do next, is you rank the books according to similarity for example,
and recommend the most similar books.
In our case, we'll have recommendations for the people who read the Sherlock Holmes book,
we will recommend them a equatorial book, for example.
We have two goals.
The first is to construct topics.
So, from the collection of the documents,
we want to find which topics are present in them.
We want to do this automatically,
in fully unsupervised way.
Just from the collection of the text,
we want to find the topics,
and the probabilities in them.
This is our first goal,
and the second goal is to assign the topics to the texts.
We would like to decompose an arbitrary book into the distribution over topics.
For example here, we do compose the Sherlock Holmes
into three topics with such probabilities.
This is exactly what we will do, throughout this module.