Welcome back to the course. So by this time, you've probably done some great things in the value-based method. You've studied how to apply Q-learning, maybe value iteration before that, and an arm checker has some bonus algorithms as usual. Now, the idea behind value-based methods that is easy to explain intuitively, is that to find an optimal action, you want your algorithm to learn how much reward, how much as a discounted do you think you will get. If it starts from this state, maybe takes this action, and then follows this policy or maybe optimal policy for Q-learning. Now, this approach is very useful, provided that you know this function. But today, we're going to study another family of methods that try to explicitly avoid learning the Q or V or any other value-related functions. To understand why, let's just use this simple example. Now, on the next slide will contain a simple question. It doesn't have any math to it, it's a question that a toddler can answer, so I want you to answer as quick as possible. Now, you have this breakout game here, and the question is, "What action do you take? Left or right?" Quick. Now, obviously, right in this case because it comes intuitively. But now, let's answer another question. There will be another question on the next slide, and this one will be slightly more mathematical, but it's what your reinforcement learning algorithm has to learn. Quickly, answer now, what's the Q function of this state and going to right? You have 10 seconds. Well, I'm still waiting. It's a bit harder than the previous one, isn't it? So the idea here is that, it's really easy for you to understand what to do, but the Q-learning doesn't explicitly learn what to do. It instead tries to learn what kind of a value, what Q function you'll get if you do this, and it's kind of hard, especially if you consider this applied to everyday problems. Let's say that you have a very simple problem of, whether or not do you go for coffee, so you can make yourself a coffee in another room. You can either go there and drink a coffee and then proceed, or you can stay here and avoid drinking coffee. Now, what you actually do is, you either feel like you want to do this, or you feel like you don't, and this is very simple. What Q-learning has to do, is it tries to learn the value of your entire life from this moment to the end, and it tries to add up all the rewards you're going to get in this day, next day, the day after that with some Gamma-related coefficients, and this is kind of impractical, especially because it takes to predict what's going to happen in the future. So when I say difficult, I actually mean that it's not only difficult for you to add up rewards with Gammas, it's also difficult for a new level to approximate. Or for that matter, any other algorithm. You have your DQN or something related to DQN, trying to learn the game of breakouts, or whether it wants to drink a cup of coffee. You actually have a squared error minimization problem under the hood. So what it tries to do, it tries to minimize the squared error between the predicted Q function and the temporal difference kind of corrected Q Function, which is on the right here. So basically, it's reward plus Gamma times whatever. And if you remember the last week, we actually considered this last term, Reward plus Gamma whatever to be constant, not dependent on the parameters of a Q function. When it comes to real world applications, your neuro network size is usually insufficient because otherwise, it will train for ages. What it actually means that, your neural network would never be able to approximate old Q values in a way that it has no error, so it would have some approximation error there, and what it actually tries to do, it tries to make an approximation which minimizes the lost function or a mean squared error in this case. So now, imagine that you have two possible Q functions. You have two possible outcomes of your learning, and you are considering them on two different states, the S zero and the S one. Now in those two states, you have two actions, A zero and A one. Let's imagine that you only have two actions, and that on all other states, your neuro network's identical. This is just for simplification. The first clone here, is the kind of true Q values, the Q values that your Q and A neural network would actually going to get if it does this particular action, and then follows its policy or the optimal one. Now, you have the first two rows corresponding to S zero. In this case, the first action brings you the Q function, the true Q function of one. The second one brings you two, and you have the second state, the S1, and in this case, the first action brings you three, and the second one brings you 100. This is not very practical, but it kind of serves the point of explaining the stuff, so you have two networks, two possible approximations. The first approximation is exact about the first state, is the A:, so it captures one and two Q values exactly. But on the second state, it captures first action right, but it fails to grasp the actual Q value of the second action. Then, you have the second possible option of what you can learn. In this case, the second state is approximately ideally, but the first state, the S zero has its Q values freed, so it has an error of plus one and minus one there. The question is, which of those two Q functions would you prefer? Which of them would get a better average of work per session, or which of them will take the optimal action more frequently? So, you've probably noticed that the option A, the detrian under the letter A, it has this innate property, that it doesn't give you some error. But if you take the optimal action, the arg maximum, the arg maximum is seen for this Q network and for the optimal Q function. Despite being slightly off in s1a1, it will in fact find the optimal policy here, unless of course, there are some other states that we have not considered. The second network, the network B, doesn't have this property, although its error is very small, and I think A it will take the sub-optimal action. So, it's ever sure to be slightly worse. Well, to conclude the network A, should be better. Let's pose another question, you have this square root error minimization problem. Which of those two options, A or B, will your network prefer when minimizing the square root error. What if the square root error is smaller? Yes, right. It's kind of the other way around. Basically, option A has the square root error of 50 squared, which is somewhere around too much, and option B only has a square root error of 2. So, it's like plus one squared and minus one squared added up. Now, this is the problem. The problem is, in that your deep Q in, while trying to minimize square root error, will avoid optimal policy and try to converge to something different, which is although more accurate than sense of a slow function, it's not what you actually want, you want to play optimally. And doing so, approximating Q function is really hard because again, you remember this Korff example, you'll have to predict what happens next. What if you could avoid learning this Q function or V function or ignore it? If you could try to approximate something else. In this case, why don't we just learn the policy here? We want the optimal policy. We want to take optimal actions, so why not just directly learn the optimal policy distribution? During earlier weeks of our course, we did have one method or a few methods if you listen to the honor check, we had the method that fits the definition of not learning Q or V function. Instead, it explicitly tries the probability of taking action in a state, either by the table or some approximation. Now, what kind of algorithm works this way? Yes, this is a cross-entropy method, and what it does, it simply plays some games with its current policy, then it means, minus the cross-entropy between the policy and the games that are more like you kind of, that reward better than average or better than some percentile. We'll cover this in more detail in just a few slides. Now, this algorithm had a lot of drawbacks. It required a lot of samples, it discarded a lot of samples, and it had things that you would solve with temporal difference rewards and with Q-learning, but it also had this neat efficiency in it. It learned kind of complicated environments really simple, in case of course, you can sample a lot of sessions in there. The reason behind cross-entropy method being so efficient, provided of course, it has a lot of samples, is because, it solves a simpler problem, it doesn't try to approximate Q function. It simply tries to learn what to do. So imagine, you're in front of a tiger and you have to decide, whether or not you want to pat the tiger, run from the tiger, provoke the tiger, or ignore the tiger. What Q-learning does, is it tries to estimate the quality of your life based on each of those choices and distinct some calculations. Now then, it simply picks an action which maximizes the expected quality of life. But in reality, before you do that, the tiger is going to eat you. What cross-entropy method did is, it simply tried to learn the probability, and this is kind of the thing you should do if you don't want to get eaten. Hopefully, you humans don't ever need Q function every time you want to make a decision.