So backing up a step, we talked about a shuffle test for significance. And that had a resampling method, but it was a little bit different because we shuffled the labels. And then we talked about bootstrap samples to drive a confidence interval and really any other kind of statistic you may need. One question is, why do we even need that shuffle test at all? Why can't we just use the confidence interval itself? After all, it's possible to reason about significance even with a confidence interval. You can see whether zero falls within the confidence interval or not. The reason is is that there's a sort of a subtle reason about what you're starting from. If you compute the confidence interval, you're already asserting that there is a significant difference, say in the means, or the statistic is significant. When you're testing significance, you're starting with a null hypothesis and you're deciding whether to reject that or not, and you can see intuitively when that might make a difference. Consider a admittedly contrived case where you only have two data values, one treatment and one placebo. Sorry, I said placebo, in this case, well let's go with placebo. So one is a treatment, the other's a placebo, and we're trying to measure the difference between them, right? And let's imagine this is still survival days, so with the treatment, they survive long. And there's one patient to survive longer, 36 days, and with the placebo they only survived 27 days. So a 95% confidence interval will range from the value 9 to the value 9, meaning there's sort of very, very tight interval. With the shuffle test, you'll shuffle the labels here, and do a bunch of experiments, and quickly determine that the p-value's about 0.5, which is of course not significant, right? Half the time the difference is 9, the other half it's negative 9, and so the p-value will be quite high. So, you always wanna start with the null hypothesis, that there is no difference, test the significance of that with the shuffle test, and then measure the confidence interval as a second step. Okay, so caveats about the bootstrap. When is it dangerous to use this? Well, really it is pretty robust. It actually makes fewer assumptions than a lot of the classical methods make, as we've described. It may underestimate the confidence interval for very small samples. Usually in these data science regimes, in these big data regimes, very small samples is not the concern that we typically have to worry about. We often have mini datasets. Bootstrapping can't be used to estimate the minimum or maximum of a population. So you should consider why that might be. So when we're taking averages, it works pretty well. But the minimum-maximum is very, very sensitive to outliers, to very local, local values. If even one more value, as you scan through the data is higher, then that will be the maximum. It could have an arbitrarily large effect on the statistic. And the bootstrap is not going to, if you take a bootstrap sample with replacement, you might miss that one outlier that really dramatically changes the value. And so you'll get a distribution that doesn't match the distribution of the maximum, okay? And so more generally, outliers can cause trouble with bootstrapping, but they sorta can with any method. So that's not a significant weakness of resampling methods in general, and in fact, in a moment, we'll see one thing you can do about outliers. Okay, so it is a little bit sensitive. Resampling can be done kind of incorrectly with more complex examples that fail to preserve the original sampling structure. So we saw this a little bit with the example of the confidence interval just now, where you need to do a bootstrap sample of one cohort and a bootstrap sample of the other cohort, as opposed to a bootstrap sample where you pool them together. There are ways to actually do it in kind of a pooled way, but in general, you need to respect that the structure needs to match the experiment you're trying to do. Okay, another case is if the data are not independent, you can't use the bootstrap sample. And it may be tempting to do so, because it may not be obvious that it's not valid. There are sometimes tricks you can play. So the trick is if you have a sequence of mutually dependent values, so that the independence assumption doesn't hold, what you can do is break those into sub-samples that don't overlap. So let's say time series of like stock prices or something, which are not independent, of course, right? Every value of a stock price depends on the value previously. Take sub-sequences of these stock prices, non-overlapping. Break a long history into chunks, and then treat each chunk as an individual observation. Compute some statistics on that, and then do resampling on those chunks. And that's been shown to work, so there are sometimes ways to apply this general technique in kind of creative ways. So it's not particularly limited to a few examples, certainly not the few examples that I'm gonna show. But think of it as a very general approach. You know, write a program, do the simulation to simulate the experiments that you want. You may have to think creatively about how to apply this in a safe manner, but it's a very general approach. As we mentioned, the maximum doesn't work. And then of course,non-representative samples will always cause trouble, and they do with the bootstrap as well. It's not gonna magically cure issues you have with the data. And you know, one quote here that I copied down that I sort of liked is, it's tempting to somebody to say, oh, you let the data speak for itself. And Nate Silver has this quote, the numbers have no way of speaking for themselves. There's always gonna be an interpretation about them. There's always gonna be some knowledge about where that data came from. There's always gonna be biases lurking in there that only the source of the data can tell you about, only the context in which it was done can tell you about. So putting this statistician's hat on as opposed to, say, the machine learning or computer scientist's hat on, understanding the source of the data, understanding the experimental design that led to that data. Or if you weren't aware of the experimental design, understanding the other kinds of biases or contexts in which it was sampled is always gonna be crucially important, whether or not you're using resampling methods or classical methods.