So what are we really doing here? What we're doing is we're asking questions of the form, what would happen if we ran this experiment a thousand times, ten thousand times? We can't actually run the experiment over and over. We can't do that, especially not with patients, but really with any kind of, data acquisition is an expensive process. So, we reason about the results theoretically, we make some assumptions about the general distribution, use that to derive the distribution, the sampling distribution, this is the statistic we're interested in. And then reason about that based on typical values in the textbook and the tables. So in this case it was a difference between means but it really could be any derived statistic. So what if instead of either doing the experiment again, which is not possible, or analyzing it theoretically by having to make some assumptions about the underlying distribution. What if instead we just ran a simulated experiment? If we can't draw new data from the original population, maybe that's okay. We already have a representative sample of that population, the data we've already collected. What if we just re-draw samples from that sample in order to simulate these additional experiments? So, let me give you an example. We started off with two populations, or excuse me, two cohorts of patients here, those who received the test treatment and those who received the standard treatment. If the difference between these two cohorts is significant, then when we scramble the labels, if we mix up the data so that we don't which ones are green and which ones are blue. And we keep repeating this until they're all sort of mixed up, then we should see a significantly different result than we did on our actual data. Our data should be rare while repeated scramblings of the data should paint out a different kind of distribution. And so we can just do that experiment by writing some code. So, I've scrambled them here manually and I recompute the means. And then I recompute the difference in the means, and now I get a number, 10.3 days in the difference between them. And I can keep doing this. I can compute the difference in the means after ten trials, and I get two values that fall in this range, and a bunch of other instances that fall in other ranges. And then I can do it with 100 trials and it paints out a different kind of distribution. You can see the shape starting to emerge. And then I can do it with 1,000 trials and now you sort of see the bell curve here. And then I can do it with 10,000 trials and now you see a nice clean bell curve. And it looks like it's kind of centered around 0, which is suspicious, and agrees with the result we found using the classical method. Our value of 12.4 days was right over here, and if you count up the number of times that the difference in means exceeded our value, it's 33% of them. So 33% of the trials produced similar or greater results. That's nowhere near enough for us to conclude that this 12.4 couldn't have happened by chance. It's right there in the width of the distribution. So we did this shuffle test, where we scrambled the labels, recomputed the mean for the two cohorts of patients. Took the difference between those means and then plotted on a histogram, repeating this experiment 10,000 times. And the distribution that gets painted out in this histogram approximates the actual sampling distribution that we tried to derive earlier using classical methods. Except there we had to make some assumptions about the underlying distribution and go slowly, and deliberately, and carefully to make sure we're interpreting all those terms correctly. Here we really didn't have to do very much work at all. And it makes sense because what we're actually doing is Simulating the experiments that all that classical work is designed to accommodate the absence of, right? You can't actually do those experiments in practice, so you have to work very hard to reason about what would happen in general by making some assumptions. Here, we just do the experiments directly. And so we directly get to the end result where the classical methods have been trying to get. And so we can state the significance quite clearly. So this is the p-value. The one minus the 33% is indeed the p-value that you may have heard about.