So, in the last video we talked about the need to use

cross-sectional information on a known Monte Carlo path,

or historical path of the stock simultaneously when calculating an optimal policy.

To remind you why this is the case,

it's simply because the edge policy is a function

that defines the mapping for any inputs to the outputs.

But each observation only gives

some information about such function at one particular point.

And updating the function using only one point can be a very long and noisy procedure.

This means that the classical Q-learning algorithm

even though it's guaranteed to be converge us asymptotically,

need to be replaced by something more practical that converges faster.

Because we're working the setting Batch-Mode Reinforcement Learning,

perhaps we could do better than

the classical Q-Learning if we managed to do updates by looking

at all realizations of portfolio dynamics that

happened in data in order to decide on the optimal policy.

And fortunately, there exist extensions of

Q- Learning for such batch-mode reinforcement learning settings.

We will take most popular extension of Q-Learning to

batch reinforcement learning setting called Fitted Q Iteration or FQI for short.

This method was developed around 2005 and 2006 in

a series of papers by Ernest and co-workers, and by Murphy.

One noticeable point here is that

Ernest and co-workers considered time stationary problems,

where the Q- Function does not depend on time.

And also many published research in reinforcement learning literature deal

with an infinite horizon Q- Learning where a Q- function is independent of time.

Q- Learning for such stationary problems is somewhat different from Q-Learning that we

need in our problem which has a finite time horizon and therefore is time dependent.

But the paper by Murphy presented the version of batch-mode Q-Learning

that works for problems with a finite time horizon such as our problem.

Now, the Fitted Q Iteration method works with continuous valid data.

And therefore, we can now bring the model formulation

back to a general continuous state space case,

as was the case in our Monte Carlo searching for the dynamic programming solution.

But If we want to stick to a discrete space formulation,

Fitted Q iteration can be used in essentially the same way.

The only difference would be in a specification

of basic functions that the method needs to use.

So, to retaliate, the FQI method works by using

all Monte Carlo paths or historical paths for the replication portfolio simultaneously.

This is very similar to how we solve the problem using

Dynamic Programming with Monte Carlo method for the case of known dynamics.

In this method we averaged over all scenarios at time C and t plus one simultaneously,

by taking an empirical mean of different pathways, Monte Carlo scenarios.

Conditioning on that information set f t,

time t was implemented as conditioning on old Monte Carlo paths up to time

C. Now in the setting of batch-mode reinforcement learning,

the only thing we need to change in such Monte Carlo

setting is the structure of input output data.

In the setting of Dynamic Programming when the model is known,

the inputs are simulated or real for the paths of the state variable extreme.

And the outputs are the optimal actions,

an action policy and optimal Q-function.

That is the negative option price.

The optimal action, a star is determined by maximization of the optimal Q-function and

instantaneously words are computed in the course of

backward recursion for the optimal action and optimal Q- function.

This was the case for dynamic programming.

Now, in the setting of batch-mode Reinforcement Learning,

neither transition probabilities nor the policy or the work functions are known.

The required output is the same as before.

That is, we have to find option price and edge.

However, the input is now richer than in the previous case,

as now we have two more observations for each observed state vector Xt.

We know the action At and the reward r

t. Instead of this function PQ star and r t,

we now have only samples obtained from these functions in different states Xt.

And this is the setting of batch Reinforcement Learning where

the state variables at both times t and t plus one and

the rewards received for all Monte Carlo paths are used

simultaneously to find the optimal action and optimal Q-Function of time t,

for all states Xt.

And all actions At of Xt.

And this amounts to learning a policy.

With either simulated Monte Carlo data or real data,

batch-mode Reinforcement Learning works the same way.

Now, lets see how Fitted Q Iteration works in our problem.

The FQI method looks for

the optimal Q-Function using an expansion in the set of basis functions.

We can use the same set of basis functions phi n

of x that we used in the dynamic programming solution.

Now, recall that the reward is a quadratic function of actions.

Therefore, the Q-function will also be a quadratic function of actions,

and we present the Q function here as a product of a vector A,

matrix W and vector phi as shown in this equation.

Here At will be a vector of the size three,

made of units At,

and one half times a.t. squared.

Phi t is a vector of basis functions with size M,

where M is the number of basis functions.

And finally, Wt.

will be a matrix of coefficients.

Which will have the shape of three by M. This form for the Q-function captures

its quadratic dependence on the action while parameterising

the dependence on coordinate Xt with basis functions phi n of x,

and coefficients W. For what follows,

we can equivalently write the Q function in a slightly different form.

If we join the efficient matrix W and the vector phi into a new vector U underscore W,

which is indexed by W to emphasize that this vector depends on coefficient

W. We can also put terminal conditions on the vector U underscore W. For example,

if we said the terminal action At for capital T to 0,

this translates into terminal conditions for components of vector Uw that are shown here.

In this case only the first component is non-zero at maturity

and two other components vanish at time capital T. Now,

we can use this representation of the Q-function in two ways.

Let's first assume that parameters of matrix Wt are known.

Then because Q- function is quadratic in action a.t,

we can find the optimal action and optimal Q-function

in terms of components of vector Uw.

If you do the simple math,

you will find these two equations.

They define optimal action and Q-function in terms of

components of vector U underscore W. But in fact,

our objective here is rather the opposite.

We have to find components of matrix Wt or equivalently

components of vector U underscore W from observable actions and states.

In this case we would rather view these equations from their right to the left,

but then we have two equations for three unknowns.

This is still okay because these unknowns depend on

the same matrix Wt and are dependent in this sense.

But it also means that to solve our problem,

we have to find directly the matrix Wt from data.

And in the next video we will see how this can be done.