0:09

Welcome to week two of our course.

In this week we will talk more about using Reinforcement Learning for Analysis of Stocks.

We have talked about this topic in the last week of our course on reinforcement learning.

In this previous course we talked about how

various classical problems in finance such as optimal stock trading,

optimal portfolio liquidation, optimal portfolio management and

the index tracking can all be solved by

formulating them as problems solve for optimal control.

Once this is done we find ourselves in a familiar terrain,

that is we can use either method solve

dynamic programming or reinforcement learning to solve these problems.

For example, if you work in trading and want to improve your strategies

in any of these tasks you can try these methods.

To make them work you will first need data.

The data you need might be of different forms depending on the settings.

One possible scenario arises when you actually work at

a trading desk so that you have complete data that would include portfolio positions,

trades done in a portfolio and rewards received.

In addition, you would need market data relevant for your portfolio.

I also would like to remind you that rewards in

this formulation are never directly observed,

unlike portfolio positions or trades made.

It rather has to be computed from

these quantities and using parameters such as risk aversion parameter lambda,

impact parameter MU and so on.

For this scenario, we have a classical reinforcement learning task.

When you have to find an optimal action policy from these data.

A method that we discussed in our course on reinforcement learning was based on

an iterative solution for a self-consistent system of the equations of G-learning.

Let me remind you that G-learning can be viewed as regularized Q-learning so

that the G function is given by the Q function with regularization by entropy.

Respectively, the self-consistent system of equations for

G-learning involves three equations for three unknown quantities;

the G function, the F function and the policy pi.

3:03

Because dynamics in the presence of markets impacts are non-linear,

we use an iterative method to solve this system.

The method perform iterative linearization of

dynamics and then updates the Q function and

F function written as

quadratic expansions around a reference value of the state variable.

But as we said before,

this setting with absorbed states actions and rewards is not the only possible setting.

Imagine for example, that you have

a historical market data and historical portfolio data made of

portfolio positions and trades that is

actions but you do not know the rewards received.

Such problem may arise even if you have portfolio trading data.

But this data is obtained using trading strategies that do not use

mean-variance market's type optimization that we assumed when we set up the problem.

In such case, a trader may not even know what their resolve risk aversion lambda,

the data correspond to.

But even if a trader does not think in terms of maximization of

risk-adjusted returns that involves a risk aversion parameter lambda,

traders actions can be consistent with some values of lambda.

However is this value or values of lambda are

unknown it also means that we do not know the rewards either.

Because rewards need to be computed from states and

actions by relations that depend on parameters lambda and MU.

So, what can we do in this case?

Well if rewards are not observed,

then instead of reinforcement learning we can use inverse reinforcement learning.

Inverse reinforcement learning or IRL deals with

problems where we only observe states and actions but not rewards.

The problem of IRL is to

find the actual reward function and the optimal policy from data.

In general, it's more complex problem than

the direct reinforcement learning because now we have to

find two functions rather than just one function from data.

However, if we deal with a parametric model then

both the policy function and a reward function are

functions of the same set of variables that

includes lambda impact parameter MU and other parameters.

In this sense finding

the reward function and action policy function

becomes the same problem as both are expressed in terms of the same set of parameters.

In this setting, IRL becomes almost as

easy or as hard as the direct reinforcement learning.

In particular, if we work with stochastic policies as we

did in our course on reinforcement learning,

then the result in policy would be a probability distribution.

Once we have it as a function of original model parameters we can estimate

these parameters simply by using maximum likelihood for observed trajectories.

Once these parameters are found we can compute rewards as they depend only on states,

actions and a model parameters.

Therefore by doing maximum likelihood on

absorbed trajectories with a parametric model we also compute the reward function.

Finally, one more interesting problem formulation

is obtained when neither rewards nor actions are observed.

Two main questions here are; first,

where can we have such settings,

and second how we should proceed about it.

Let's first discuss why such problem formulation can be interesting.

I can think of at least two problem formulations where it can be of interest.

The first one arises in intraday trading.

If you work for a large dealer whose trades can substantially move

the market via market impact of trading

you may want to know strategies of your competitors.

You can see market prices but you cannot directly observe actions of your competitors.

However, if you have an estimate of a portfolio of your competitor,

and an estimate of a planning horizon of

the competitor then you can still do inference of

your competitors section policy if you

treat their actions as unobservable or hidden variables.

In this case, you can use algorithms that work with hidden variables to make inferences.

One such algorithm is the EM algorithm that

we discussed several times in the specialization.

Another setting where we only observe states but not actions arises when

we consider market dynamics using the approach of inverse reinforcement learning.

This actual will be the topic that we will discuss in our next video.