[MUSIC] I'm going to illustrate how to use approximate dynamic programming and reinforcement learning to solve high dimensional problems. This is something that arose in the context of truckload trucking, think of this as Uber or Lyft for a truckload freight where a truck moves an entire load of freight from A to B from one city to the next. Here we have to pick the best driver on the left to assign to the best load on the right. We don't have to move all the loads and I have to think about where the load's going and what might happen which is uncertain. Let's imagine that we just have one truck driver. So this is the easier problem, imagine that we're a new truck driver. We're starting off in Texas and somebody's come along and showed us four loads of freight and I have to pick the best one to move. Each of the loads are going to different destinations I've never been to before. My estimate of the downstream value of these destinations is zero because I'm brand new. Of these I'm going to look at all of them and say gee the load going to New York is $450. That looks best, I'll pick that one. And before I move I'm going to update the value being in Texas to $450 because that's how much I made. Because the 450 I'm going to add it to the downstream value of zero. Now, let's move to New York, repeat the whole process all over again. Now to go back to Texas I would add the 125 to the 450 to get 575 but there's a load going to California making $600 so I'm going to take that one. So now we do the value of New York is 600, go to California. Now I'm looking at various locations, one load is $400 going to Minnesota, but I've never been to Minnesota. So my best estimate of that is zero just because I never been there. So we're going to come back to that problem. But for the moment the best load looks like the one going to Texas because I'm going to add the 350 to the downstream 450 to get $800. So we're going to go to Texas. Now I'm in Texas, once again I'm looking at various loads, there's one going to Colorado. I've never been to Colorado but it's $800. The last time I was in Texas I made 450. Let's go ahead and see, the old estimate of the value of being in Texas is 450. But now I've got this $800 load. What I'm going to do is smooth these two, I'm going to say use a smoothing factor of 0.1, a learning rate step size to some where I'm going to take one minus the step size. So I get a number of 0.9 times the old estimate plus 0.1 times the new estimate gives me an updated estimate of the value being in Texas of 485. So this is my updated estimate. Now, this is classic approximate dynamic programming reinforcement learning. The oral community has many variations of what I just showed you, one of which would fix issues like gee why didn't I go to Minnesota because maybe I should have gone to Minnesota. But this is also methods that will only work on one truck. What if I have a fleet of trucks and I'm actually a trucking company. So let's imagine I have a grid, a five by five grid. So there's 25 locations to go to and so my truck can be in any one of these 25 locations. But if I'm a trucking company and I have two trucks, now my state is not what location my one truck is in. My state is the state of the fleet and I have all these different combinations of where to put two trucks. And maybe I have six trucks and now I have all these permutations and combinations. In fact as I grow this the formula for the number of states that my fleet can be in is given by this formula. And as I grow the fleet size from one to five to 50 and if I grow my number of states or locations or attributes of the truck, I can get this massively large number of states. So classical discrete state representations best known in reinforcement learning would simply not work for a fleet of trucks. So here I'm going to show you how to extend to a fleet of trucks. Let's imagine that I have a series of set of drivers and loads and I have to assign drivers to loads. This is an assignment problem. So here are my drivers and here are my loads and over time drivers will enter and leave the system, new loads will be called in. I have to make those assignments in the future sequentially and I need to think about those events that are going to happen in the future to make the best decision now. So let's take one truck now where I can look at two different loads and each one has a downstream value. I can take those downstream values, add it to the contribution that I earned from just making that initial assignment to a load. And now I just have a simple assignment problem, if I grow that to an entire fleet this would be a network problem, some call it a linear program. There's packages for this, Gurobi, CPLEX. These are available in Python. If you're in a university, these are free and you can solve very large problems very, very quickly. Now how I do those downstream values because I know how to do the value of a truck but how do I handle the fleet? So here's what we're going to do. We're going to go ahead and solve our assignment problem. We're going to assign three drivers to three loads, that fourth driver has nothing to do. So he just stays and each one has a downstream value. Now, what I'm going to do is take away the first driver for example, so we're going to move it, re-solve. Now I'm going to get the change in the total contribution for the whole system. That's the marginal value of that one driver with attribute A1 and we're going to call that marginal value V hat. And so this is the difference between the total contribution of the whole system with and without the driver. Now we're going to do the smoothing just like we did with one truck where we have the attributes of our one driver. We have the old smoothed estimate V bar, the V hat is the updated marginal value of the one driver and we get an updated smoothed estimate of the marginal value. Now I have to do this for each driver, okay. So the thing is I'll have to loop over the drivers. However, it is also the case that if we can solve this as a linear program, LP packages will give you this marginal value, it's called a dual variable, for free. Other instances, you have to do these numerical derivatives yourself, but it's all quite scalable. And at the end what we can do is have these smoothed estimates V bars. Now, the V bar is not the value of a driver, it's the marginal value of a driver. But this is a linear program and I can scale this to hundreds, even thousands of drivers. This is the overall algorithm. The green is the linear program where I have to call a solver to solve that, X is a vector here, the V hat is a vector as well. The V hat is the marginal value of each of these drivers. The blue is where I update these marginal values the V bars, and the red is where I then simulate forward in time. So let's illustrate that simulation. So the blue squares here is my problem now with a value in the future. I'm going to simulate my way into the future and I'm going to do this iteratively. And over the iterations I'm going to get better and better estimates of these marginal values and the oval quality solution tends to increase not uniformly, you'll see some little hips and skips there. But generally if you've got defined algorithm, this will move forward. This has been a very brief introduction to handling high-dimensional resource allocation problems. If you go to the website jungle.Princeton.edu, you'll get a lot more additional material, just look under educational materials. For hints of how to program this look for the Python modules on my GitHub directory and look for the blood management problem, and you'll see a nice illustration.