Application of the heuristic search to planning in reinforcement learning has

many advantages.

Basically resources are focused only on valuable path and

nearest states contribute the most to the return.

However, if we don't know the precise model of the world and

approximate it with some sort of function approximation,

the model can be in fact worse than the current value estimate.

In this case, the lookaheads based on approximate model can spoil the learning

and turn the estimates of a reliable value function to become less precise.

Remember, it only makes sense to perform planning in the model of the world

if more precise than the current value function estimates.

So beware.

Another disadvantage of using heuristic

search is that it obviously depends on the quality of heuristic.

We will talk about it a little bit later.

One way to obtain heuristic is to estimate the returns with Monte Carlo.

And I have previously said, if you limit the horizon of a lookahead search,

we should estimate the value of possible continuations from the leaves onward.

These leave node values can be computed with the help of function approximation.

We can learn the state values with function approximation

as we did it before in a model-free setting.

However, before today, we were not allowed to use the model of the world and

now we can try to make use of such a model

instead of relying on complex parametric function approximation.