So, in the last video we introduced two parameterizations for the Q function.

One in terms of matrix WT and another in terms of

vector U_W which was given by a product of matrix WT and vector phi of basis functions.

We saw that if we somehow observed

optimal actions HE star and optimal Q function values QT star,

this would show us two equations for three unknown components of vector U_W.

Now, this suggests that a right solution,

the one that would be found by the dynamic programming method if the dynamics are known,

should be in the space of solutions parameterized by

a time-dependent matrix WT as we did in our specification of the Q function.

So, we have to find a way to learn the matrix of coefficients WT.

Let's see how this can be done.

What we need to do is to rearrange terms in

our definition of the Q function to convert it into

a product of a parameter vector and

the vector that depends on both the state and the action.

So, how we do this.

We do it in several simple steps.

First, let's write explicitly the value of the Q function as

a trace of the matrix expression given here.

We write it as the element y is product of elements of matrix W with

elements of the matrix that is formed by direct product of vectors A and phi.

This gives us the second line in this expression.

We have two types of multiplication here.

First, this symbol stands for the element y is product of two matrices,

this product is also known as the Hadamard product of two matrices.

Second, this other expression has

this circle crossed product sign and this sign stands for

a matrix given by the outer product of two vectors A_t and Phi_t.

By definition, the element ij of this matrix is given by a product of A_ti and Phi_TJ.

So, so far so good,

we converted the whole expression into a trace of a product of two matrices.

But now, we can do one more thing and represent

this expression as a scalar product of two vectors.

What we have to do to this sense is to convert

both matrices entering this expression to vectors.

We can do this by concatenating columns of

matrix W and the matrix given by the outer product of A_t and Phi_t,

and these will give us the third line in this equation.

Now, we can compactly write this whole expression as

a dot product of vector W and vector psi.

And here, vector psi is obtained by concatenating in

the columns of the outer product of vectors A_t and Phi_t.

This can be viewed as a new set of basis functions that depend both on state and action.

The dependence on the action A_t is quadratic but the dependence on the state X_t

can be arbitrary and depends on a functional form

of the original basis functions Phi_N for the state space.

If we use for example cubic D splines as

basis functions as we did in the dynamic programming solution to the model.

Then the X dependence will be a local cubic.

Now, let's compare what we got with the setting of dynamic programming.

In our current Reinforcement Learning setting,

we reduce the expression for the Q function to a dot product of

vector W and the new vector of state action basis function Psi_t.

Both have a number of components given by 3M,

so that the number of unknown coefficients in this problem is 3M as you could also

easily guess just by looking at the matrix W_t which has dimension three by M. So,

we have 3M unknowns here,

and now what about the number of observed variables?

For each time step in the batch mode reinforcement learning setting,

we have 3M observables because we observe the state X_t,

the action A_t and the reward R_t for all

of N historical or Monte Carlo pass for the stock.

So, we have a 3N observables for 3M unknowns.

And if N is larger than M,

then our problem is well-posed,

we have N divided by M observations per one phi parameter in this setting.

Now, we can compare these with

the situation we had with the dynamic programming approach.

In that case, we had to find two quantities,

the optimal policy and the optimal Q function.

In our dynamic programming solution,

we expended both in M basis functions,

so we had 2M unknown coefficients in total.

This is less than 3M unknown coefficients in the Reinforcement Learning setting.

But on the other hand,

the number of observations per times slice in

a dynamic programming approach was also less than in Reinforcement Learning approach.

And this is because in the dynamic programming setting,

we only absorb the states.

So, for N Monte Carlo pass,

we only had N data points each time.

The number of observations per parameter was therefore N divided by 2M which is

actually less than the ratio of N divided

by M that we have for the Reinforcement Learning setting.

So, it appears that at least when the data for

Reinforcement Learning is generated from the dynamic programming solution,

both the reinforcement learning and dynamic programming methods

should perform about equally well.

This setting is called on-policy learning and it means that action is

required for training of

the Reinforcement Learning occasions were actually optimal actions.

It turns out that this is indeed the case and

both algorithms produce nearly identical results for such setting.

This actually gives us a simple way to benchmark

our Reinforcement Learning algorithms because we can't so far model,

when it's fully specified by means of dynamic programming,

we can compute the optimal policy from this solution,

then we can just use the absorbed actions and

the words obtained with this policy as data for

Fitted Q-iteration method and compute

the optimal policy and Q function using this method.

Because this solution is already known from dynamic programming,

we can use it to benchmark

our Reinforcement Learning solution and check the speed of convergence,

the final results, and so on.

This is actually very ensuring as we now know exactly what we expect to see,

so any problems with Fitted Q-iteration would immediately show up if found there.

If we now want to try some other Reinforcement Learning algorithm,

instead of Fitted Q-iteration.

Again, we have a good benchmark obtained with a dynamic programming approach.

Therefore, we can test

many different Reinforcement Learning algorithms with

our simple MGP model for option pricing.

Let's take a short pause here and continue in the next video with

the actual solution of the model with the Fitted Q-iteration method.