[MUSIC] In this video, we are going to talk about Monte Carlo Tree Search, a method which is practical and useful for many purposes. Exactly this algorithm was the crucial component of the AlphaGo, the first algorithm that beated the best human player in the board game called Go. So for now, you already know why we need to explore in reinforcement learning. Put it simply, exploration is needed to find originally unknown actions and states that lead to very large rewards. And, yes, indeed, in a model-free setting, we don't know how the environment will respond to our reactions and thus don't know the reward. However, in a model based setting, this situation is a little bit different. Let me quickly remind you the differences between the model-free and model-based settings. The key difference lies in their names. In a model-free setting, we know nothing about environment dynamics. That is about probabilities of the next state and reward, given current state and action. In a model-based setting, however, we are given a model of the world in either of the two forms. That is either as a distributional or a sample model. Distributional model, also called explicit model, provides access to the whole distribution of the next state in the word given current state and action. It is doing so in a form of explicit probabilities. Such distributional model allows access to explicit probabilities of all the possible next states, s prime and rewards, r for any state and action and agent might be currently in. The environment can also be given in a form of a sample model. That is out of this model, we can obtain only samples of the next state and the reward given current state of action. But we do not know the explicit probabilities of this sample. That this next state in reward are sampled according to the same distribution that is given by the distribution model but in the case of sample model, we are just not allowed to see what these probabilities are. Sample model is also called generative model in scientific literature. Now think a little bit about an exploration in a model-based setting. Do we need any if we already know what is their work condition on any state and any action? In a model-based learning, we already know everything, don't we? Well, in fact, we don't. What we know in a model-based setting are immediate towards that follow arbitrary action might be in arbitrary state. But our decisions should not be granted on immediate rewards but rather on the global estimate of the return. That is, for example, on the value or action value function. So the problem here is that being locally optimal, that is really with respect to immediate rewards, doesn't apply being optimal globally. And we need to recover the global estimates of optimality, that is value or action value function. The solution seems simple. That just recovers a globally optimal behavior by simply trying each action and each state and recording the return obtained by every policy. Well, in practice, such brute force approach is not feasible because of possible large state and action spaces. Thus, a more clever approaches are needed to solve the problem of recovering given to model of the word. These approaches to finding an optimal solution to sequential decision making problem in a model-based setting as fast as possible are called planning algorithms. The word planning is a very compound word, so for the sake of clarity, we define planning as any process by which the model of the world is transformed into a policy. Know that planning algorithm can work with any form of the environment model. This setup of planning may be very unfamiliar to you but in fact it is not very much different from what you have already learned. For example, any model-free algorithm you know can be used in a model-based setting by threading the model at hand at the true environment and letting an agent interact with it. Really now but it forbids you to solve the problem with already familiar methods. More to say, this approach to learning is actually used in many algorithms, especially in the ones which learn their approximate model of the world from interactions with the real world. Nevertheless, you should know that there are some special algorithms for planning which may be more efficient. Algorithm differs a lot depending on the type of planning which may be one of background planning and decision time planning. Background planning is an approach of sampling from given environment model, either true or approximate, and learning from such samples with any model-free method. And a good example of background planning is dynamic programming. For example, when you are doing value iteration, you cycle through the states to perform backup in each state. And you are not focusing on any particular state. The second type of planning is decision-time planning. As its name suggests, algorithms from this category make decisions and perform planning simultaneously sometimes. To be precise, in this method, planning is started after [INAUDIBLE] itself in a new state. This planning is performed to select an optimal decision now in current state. This focusing only on the current state and not trying to improve the policy everywhere distinguishes decision-time planning from background planning. Once the planning is over or time is up, an agent commits an action based on the planning result and make a transition to the next state. Usually, but not always, new information is carried by the agent between the successive states. That is after an action is made, all the planning results may be thrown away completely. In both variants of planning, however, the main idea is the same. To obtain a perfect policy [INAUDIBLE] the return that will follow after any action made in the state we are currently in. To obtain such knowledge, we, in one form or another, always do the same thing. We enroll the local dynamics in time and accumulate the sequence of rewards in a single return. Planning algorithms differ mostly on how and when such enrolls are performed. Let's now discuss one of the simplest approaches to planning. Well, the simplest approach is to perform exhaustive search. That is, start from current status and roll the environment [INAUDIBLE] for each possible action in this state, obtaining the set of new possible states and immediate rewards. Then for each of the next possible states, again, do the similar [INAUDIBLE] enroll and so on and so forth. But what if we aren't given enough computational resources? We could continue this process until every single branch of the tree reaches one of thermal states. Then we could perform dynamic programming. Start from the very bottom of this backup tree and propagate true value functions to the very root of the free using Bellman Opthomology Equations. However, this algorithm is not practical, because it may require too much time to finish planning, even for a single state. How can we do better? Well, the simplest way to improve this algorithm is to stop enrolling as soon as we have reached some prespecified depths. Then perform back out from the leaves and the depths operate to the root. This message seems a lot faster than the previous one, however, it's myopic, because it does not account for reward that the cure deeper that the depths we have stopped at. And of this improvement for this algorithm is to compute approximate function at the leaves of such three. You can think about this approximation as applying approximate value function at leaves and then propagating these approximate values backward to the top, again, using the equation. The good point is that this algorithm is a lot more practical. But can we improve it even further? Well, yes, in fact, we can. What we can improve is the selection of nodes which we unroll or expand on each iteration. In the previous algorithm, this expansion was uniform. That is the expansion was made for each and every action available. However, we might not be interested in expanding branches that corresponds to bad values. That is making actions leading to the very bad returns. The more [INAUDIBLE] bad action, the more precise we are making the value estimate of this action. Indeed, each action is obviously bad and other actions look a lot better. We do not really need extra precision in bad action value estimate. Remember, we are always blending between precision of estimates and the computation time. Instead, we require the precise estimate only for the best or close to the best actions. So to solve a planning task as fast as possible, we should expand [INAUDIBLE] actions which are good candidates to become the best ones. To give the expansion in the direction of most promising actions, we can use any function which correlates with true returns that our policy's able to achieve. In artificial intelligence community, such function is usually called heuristic function. And any planning method using such heuristic function is called a heuristic search. However, the heuristic function in artificial intelligence is usually hardcoded and never changes. For example, consider a task of planning the trip from one city to another through a lattice of roads. Each road connects to the cities and has a cost of traversing it. Plan is needed to obtain the shortest possible path from the start to the goal city. In such a task, the heuristic function may be, for example, the straight line distance to the target city. It is surely impossible to drive in a straight line fashion. But this heuristic somewhat correlates with the total driving distance from the start to the end and thus may be a good guess in a search process. The heuristic search in artificial intelligence community is an umbrella term unifying many algorithms with an idea of lookahead planning guided by heuristic function. This function is used to prove unfruitful branches by estimating the total cost or a worth that can be obtained from any single state [INAUDIBLE]. We are going to rely heavily on this idea of heuristic search. But we'll use it for more specific purpose of planning in reinforcement learning. In reinforcement learning, the heuristic function corresponds to the value function and unlike the usual heuristic function, we might want to refine our value function over time. Application of the heuristic search to planning in reinforcement learning has many advantages. Basically resources are focused only on valuable path and nearest states contribute the most to the return. However, if we don't know the precise model of the world and approximate it with some sort of function approximation, the model can be in fact worse than the current value estimate. In this case, the lookaheads based on approximate model can spoil the learning and turn the estimates of a reliable value function to become less precise. Remember, it only makes sense to perform planning in the model of the world if more precise than the current value function estimates. So beware. Another disadvantage of using heuristic search is that it obviously depends on the quality of heuristic. We will talk about it a little bit later. One way to obtain heuristic is to estimate the returns with Monte Carlo. And I have previously said, if you limit the horizon of a lookahead search, we should estimate the value of possible continuations from the leaves onward. These leave node values can be computed with the help of function approximation. We can learn the state values with function approximation as we did it before in a model-free setting. However, before today, we were not allowed to use the model of the world and now we can try to make use of such a model instead of relying on complex parametric function approximation. More precisely, if we already know the true reward [INAUDIBLE] an action in each and every state, probably the most natural approach is to use this information by accumulating the reward along simulated trajectories. That is we can estimate the value of any state or state action [INAUDIBLE] with Monte Carlo returns. In particular, while time permits, we start from state s and send action a to our model of the world. Obtain the next state and reward, then from that state onward, we place the episode using fixed policy [INAUDIBLE] the episodes. Such policy used to estimate the value by Monte Carlo trajectory sampling is called rollout policy. This name stems from playing out the game from the particular state many, many times. That is rolling the game out many times. By performing such rollouts many, many times, we can obtain a reliable estimate of q function, simply as average return obtained by our rollout policy from state s after making action a. The good point of Monte Carlo estimates is that it is not needed to use any function approximation. It is sufficient to rely only on the model of the world at hand. Besides that, each rollout is totally independent of the others, which should also perform many rollouts simultaneously on many CPU [INAUDIBLE]. However, the precision of the estimates obviously depends on the number of returns average. This is again the instance of a time versus aggressive trade-off. Another instance of the same trade-off immediately when we use such Monte Carlo estimates for action selection. [INAUDIBLE] let us see how to select an action, given Monte Carlo estimates of return for each action. Probably the most straightforward idea is to select an action that maximizes a Monte Carlo value estimate. That is, to act [INAUDIBLE] with respect to estimates of action value function. Actually, this is not only straightforward but also strikingly similar to value iteration step applied to our allowed policy and the single state only. This one step improvement only coupled with the fact that we don't estimate any optimal action value function that is done estimate q*. Results in an important conclusion. If we want our policy prime to be optimal or close to optimal, we should make our rollout policy as good as we can. That is, we want our rollout policy to be only one policy improvement step behind the optimal policy. However, [INAUDIBLE] are not so easy. If rollout policy is good, that may mean it requires much time to decide what action to make. And here emerges the second instance of the time versus accuracy trade-off. If a policy is a very good one but a very slow one, we will not have enough time to generate sufficiently many rollouts. This in turn means that during one step of value iteration, we will rely on the very noisy estimates of action-values. And thus, the performance may also degrade. So the disadvantage of such algorithm is that it obviously relies on quality of rollout policy. However, and more surprising, such algorithm works really well even if the rollout policy is random. Also I would like to know that this algorithm does not preserve any estimates between successive states. And this may not be the best thing to do. Additionally, this algorithm does not care about uncertainty in its action value estimates. In fact, these two last drawbacks are the ones we are going to fix in a couple of slides. But for now, let us briefly summarize what we have learned so far. We have talked about planning which is the process of obtaining a policy out of the model of the world. We have noted that in planning, we have a specific constraint, time. And many algorithms are designed to respect this constraint. We have also discussed that we may need some sort of heuristic to make planning algorithm efficient. This heuristic should measure approximate cost of achieving the planning goal from any particular state onward. In the case of reinforcement learning, we have argued that Monte Carlo estimate of return may be an intuitive heuristic. And we also can build a policy on top of such heuristic. For example, we may act greedily with respect to Monte Carlo estimates which will correspond to the one step of value iteration. What we have not yet covered is how to preserve estimates between successive timestamps, how to deal with uncertainty in estimates, and how actually guide the search with heuristic in a smart way. Now we are going to discuss this point through the example of Monte Carlo tree search algorithm. [MUSIC]