In this video, we will explore

the flexibility of the MDP formalism with a few examples.

By the end of this video,

you will gain experience formalizing

decision-making problems as MDPs,

and appreciate the flexibility of the MDP formalism.

Consider recycling robot which collects

empty soda cans in an office environment.

It can detect soda cans,

pick them up using his gripper,

and dropped them off in a recycling bin.

The robot runs in a rechargeable battery.

Its objective is to collect as many cans as possible.

Let's formulate this problem as an MDP.

We will start with the states, actions, and rewards.

Let's assume that the sensors can only distinguish

two charged levels, low and high.

These charged levels represent the robot's state.

In each state, the robot has three choices.

It can search for cans for a fixed amount of time,

it can remain stationary

and wait for someone to bring in a can,

or it can go to the charging station

to recharge its battery.

We only allow recharging from the low state

because recharging is pointless

when the energy level is high.

Now, let's consider the transition dynamics.

First, let's try the states using open circles.

Searching for cans when the energy level is

high might reduce the energy level to low.

That is the search action in the state

high might not change the state.

Let's say with probability Alpha,

or the energy level might drop to

low with probability one minus Alpha.

In both cases, the robots search

yields a reward of r_search.

For instance, r_search could be plus 10

indicating that the robot found 10 cans.

The robot can also wait.

Waiting for cans does not drain the battery,

so the state does not change.

In both cases, the wait action yields a reward of r_wait.

For example, r_wait could be plus one.

Searching when the energy level is

low might deplete the battery,

then the robot would need to be rescued.

Let's write this probability as one minus Beta.

If the robot is rescued then its battery is restored.

However, needing rescue yields

a negative reward of r_rescued.

For example, r_rescued could be minus

20 because we were annoyed with the robot.

Alternatively, the battery might not run out.

This occurs with probability beta and

the robot receives a reward of r_search.

Taking the recharge action restores the battery

to the level high and receives a reward at zero.

That's it. We have completely

specified the MDP for the recycling robot problem.

We have discussed one example where an MDP

is used to precisely specify a problem.

But you might wonder, how general is this framework?

The MDP formalism can be used

in many different applications,

in many different ways.

States can be low-level sensory readings,

for example, in the pixel values of the video frame.

They can also be high-level such as object descriptions.

Similarly, actions can be low-level,

such as the wheel speed of this robot.

Actions can also be high-level,

such as go to the charging station.

Time-steps can be very small or very large.

For example, they can be one millisecond or one month.

Let's look at one more application.

Suppose we want to use reinforcement learning to control

a robot arm in a pick-and-place task?

The goal of the robot is to pick up

objects and place them in a particular location.

There are many ways we can formalize this task.

Here's one possibility.

The state could be the readings of

the joint angles and velocities.

The actions could be the voltages applied to each motor.

The reward could be plus 100 for

successfully placing each object.

But we also want the robot to

use as little energy as possible.

So let's include a small negative reward

corresponding to the energy used.

That was not so hard.

To recap, the MDP framework can be used to formalize

a wide variety of sequential decision-making problems.