Hi there. In this lesson,

you will study topic modeling.

What is topic modeling?

It is a way of unsupervised learning,

which means that our documents do not have any labels.

Topic modeling is dedicated to discovering latent topics in a collection of documents.

In opposite to supervised learning,

we are going to infer an internal structure of our documents.

There are several main applications of topic modeling.

It is used for document classification, document clustering,

visualization, and for building text recommender systems.

The input of topic modeling is,

for example, a set of web pages in Internet with news.

And one possible problem which we are going to solve

is doing automatic clustering of these web pages.

For example, here are some fictional clusters like entertainment,

tech, sport, and so on.

And I want to emphasize here that topic modeling allows us to

do this clusterization without labels in completely unsupervised fashion.

Here are some basic assumptions and definitions of topic modeling.

Let's make w a word,

d a document, t a topic.

We assume that the number of topics is fixed,

and we will denote it by capital T. And we assume here

that each topic t may generate a word w with some probability.

We assume that each document d has a topic t with some other probability.

These are very basic probabilistic assumptions of topic modeling.

And I will explain these assumptions a bit later using some examples.

Two of the most common methods for topic modeling are LDA and PLSA.

Only LDA is implemented in spark ml.

Here is an example from the research paper of the Professor David Blei,

which is one of the inventors of topic modeling.

In this research, David Blei used the document collection which

contained 17,000 articles from the journal Science,

and he applied to these articles topic modeling with a hundred of topics.

Let's see what happens here in the journal's article taken as an example.

This document have a list of latent topics and each word is generated from some topic.

For example, this document is called Seeking Lives Bear Genetic Necessities,

and the topics of this article are shown in this slide on the left.

The first topic is about genetics.

The second topic is about life and evolution.

This third topic is about brain and neural system,

and the last topic,

which is shown in blue color, is about computers.

I want to emphasize here that this method does not produce interpretable names of topics,

because this method is completely unsupervised.

Okay, one more example from this research.

Here are four topics and,

for each topic we have selected,

five words with maximum probability of being generated from this topic.

Again, the names of topics are not created by this algorithm.

And you can see the first topic depicts human genome,

the second topic is about evolution,

the third topic is about diseases, host,

and bacteria, and the last topic is about computers.

Okay, what are document topics?

Document topics are the distribution of topics inside this document.

We assume that each document is a mix of some basic ideas.

And here is a distribution for the particular research paper.

Finally, I want to say a couple of words about topic model learning.

It is a quite complex,

and here is a sketch of the learning procedure.

This procedure involves two matrices, phi and theta.

The phi matrix is a distribution of words with respect to topics.

And the theta matrix is a distribution of topics with respect to the documents.

By n_sub_dw we denote a number of times which word w occurs in the document d.

And by n_sub_d we will denote the lengths of the document d. It means that n_sub_dw

divided by n_sub_d is a frequency of a particular word w in the document

d. We approximate this frequency with the product of two matrices, phi and theta.

You may think about topic modeling as a method for decomposing matrix.

This is a very basic explanation of how topic modeling is done.

In this lesson, you have learned what topic modeling is.

You can identify the applied problems where topic modeling may be useful.

Topic modeling discovers latent topics in collections of documents.

Document labels are not required.

Two main algorithms of topic modeling are PLSA and LDA.

Each topic has the list of most typical words,

and each document has a list of its topics.