Welcome to week two of practical Bayesian methods. I'm Alexander Novikov and this week we're going to cover the Latent Variable Models. So what is the latent variables? Why do we need them? And how to apply them to real problems? And the second topic for this week, is the Expectation Maximization algorithm, which is the key topic of our course and this is a method to train latent variable models. We will see numerous extensions of this expectation maximization algorithm in the following weeks. So, let's get started with the latent variable models. Latent variable is just a random variable which is unobservable to you nor in training nor in test phase. So, latent is just hidden from Latin and as an example of those you can think. So for example, some phenomenons like heights or lengths or maybe speed can measure directly and some others can not. For example, incidences or altruism. You can't just measure altruism with some quantitative scale. And so these variables are usually called latent. And to motivate why do we need to introduce this concept into probabilistic modeling. Let's consider the following example, see you have an IT company and you want to hire an employee, so you have a bunch of candidates, and for each candidate, you have some data on them. So for example, for all of them you have their average high school grades, for some of them you have their university grades and maybe some of them took IQ tests and stuff like that. And you also conducted a phone screening interview. So, your HR manager called each of them and ask them a bunch of simple questions to make sure that they understand what your company is about. Now, you want to bring these people onsite to make an actual technical interview, but the problem is that you have too many candidates. You can't invite all of them because it's expensive. You have to pay for their flights, for their hotels and stuff like that. So, natural idea arises, let's predict the onsite interview performance for each of them and bring only those who are predicted to be good enough, so how to predict to be a good fit for our company. Well, if you're in the business for a while, you may have some historical data. So, for a bunch of other people, you can know their features like their grades and their IQ scores, and you know their onsite performs because you have already conducted these interviews. Now, you have a standard regression problem. You have a training data set of circle data, and for new people you want to predict their onsite performance, and you want to bring on their onsite interviews only those whose predicted onsite performance is good. However, there are two main problems why we can't apply here the standard regression methods from machine learning. So first of all, we have missing values. For example, we don't know university grades for all of them because Jack didn't attend university. And it doesn't mean that he is not a good fit for your company. Maybe he is but he just never bothered to attend one. So we didn't want to ignore Jack but we want to anyway predict for him some meaningful onsite and field performance score. And the second reason why we don't want to use some standard regression methods like linear regression or neural networks, is that we may want to quantify uncertainty in our predictions. So, imagine that for some people, we may predict that their performance is really good and we certainly want to bring them onsite and maybe even want to hire them right away. But for others, the predict performance is not good. And for someone, the predict performance can be for example of 50, which may mean that this person is not a good fit for your company. But it may also mean that we're just not sure about him. So, we don't know anything about him and you asked the algorithm to predict his performance and he returned you some number but it doesn't mean anything. So in this case, we may want to quantify the uncertainty of the algorithm in the predictions. So, if the algorithm is quite sure that this person will perform at a level of 50 out of 100 for example, then we may not want to bring him onsite. On the other hand, if some other guy predicted performance is also 50 but we're really uncertain about his performance, then we may want to bring him anyway and see maybe we just don't know anything about him. So, he may be good after all. And the reason for this uncertainty may be for example that, he has lots of missing values or maybe his data is a little bit contradictory or maybe our algorithms just aren't used to see people like that. So these two reasons, having missing values and wanting to quantify uncertainty, bring us to the needs of for probabilistic modelling of the data. And as we discussed in week one, one of the usual way to build probabilistic model is to start with drawing some random variables and then understanding what are the connections between these random variables. So, which random variables correlate with each other in some way? And in this particular case, it looks like everything is connected to everything. Like if a person's university grades are high, it directly influences our beliefs about his high school grades or his IQ score and this is true for any pair of variables here. And the station where we have all possible edges, like everything is connected to everything, means that we've failed to capture the structure of our probabilistic model. So we end up with the most flexible and the least structured model that we can possibly have. And in this situation, we have to assume. So to build up a probabilistic model of our data now, we have to assign probability to each possible combination of our features. So there are exponentially many combinations of different university grades, different IQ scores and stuff like that. And for each of them we have to assign a probability. And this tables of probability has been billions of entries and it's just impractical to treat as parameters, to treat these probabilities as parameters. So, we have to do something else but we always can assume some parametric model, right? We can say that we have these five random variables and that probability for any combination of them is some simple function. For example, exponent of linear function divided by normalization constant. In this case, you reduce your model complexity by a lot. Now, we have just like five parameters which we want to train. But the problem here is with the normalization constant. So, to normalize this thing, so it will be a proper probability and it will sum up to one, we have to consider the normalization constant which is the sum of all possible configurations. And this is a gigantic sum. We have to consider all billions of possible configurations to compute it and this means that the training and inference will be impractical. So what else can we do here? Well, it turns out that you can introduce a new variable which you don't actually have in your model of the world, which is called intelligence. So you can assume that each person has some internal and hidden property of him which we will call intelligence and for example measure on the scale from one to 100. This intelligence directly causes each of these IQ scores and university grades and stuff like that. Of course, this connection is non-deterministic. So an intelligent person can have a bad day and write test poorly. But this is direct causation, so intelligence directly causes all these observations. And if I assume such a model, then we reduce the model complexity by a lot. We raised lots of features and now our model is much simpler to work with. So now, we're going to write our probabilistic model by using the rule of sum of probabilities, it's the sum of all possible configurations given the intelligence times the prior probability. And these are conditional probability factorizes into product of small probabilities because of the structure of our model. So now, instead of one huge table with all the combinations for five different features, we have just five small tables that assigns probabilities to a pair of like IQ score given intelligence. This means that they're able to reduce the model complexity and now to model without reducing the flexibility of them all. So to summarize, introducing latent variables may simplify our model. So it can reduce the number of phases we have. And as a consequence of that, we can reduce the number of parameters. And some other positive feature of latent variables, is that they are sometimes interpretable. So for example, these intelligence variable, we can, for a new person we can estimate his intelligence on the scale from one to 100 and for example it can be 80. What does that mean? Well, it's not obvious because you don't know what the scale means and you're not even sure that this intelligence means actual intelligence because they never told you model that these variables should be intelligence, you just said that there should be some variable here. But anyway, this variable can be interpretable and you can compare intelligence according to this scale of different people in your data set. And some downside of latent variable models is that they can be harder to work with. So, training latent variable model, you have to rely on a lot math. And this math is, what this week is all about. So in the next videos, we will discuss methods for training latent variable models.