[MUSIC] In talking about publication bias was the idea of a meta-analysis. Okay, so this is looking back over previous studies and combining their results. And so in 1978, Glass sort of statistically aggregated the findings of 375 psychotherapy studies to disprove a claim, the psychotherapy was useless. And this was the coining of the term meta-analysis. And it's built on earlier ideas from other statisticians, including Fisher, who has this quote. When a number of quite independent tests of significance have been made, it sometimes happens that although few or none can be claimed individually as significant, the aggregate gives the impression the probabilities are on the whole, would it be less like to have been obtained by chance. Okay, so this is individually a little weak but we can aggregate them to get a more powerful result. Okay, and so the reason I wanna bring this up, this idea of meta-analysis, is that it becomes more important in the context of data science, because you'll often be working with data that you did not yourself collect. And so, thinking of it as a meta analysis experiment is potentially useful. And another point is that big data may have become big because it was combined from multiple different sources. And so understanding when this is okay to do and when this isn't is important. So how do we do this meta-analysis? Well, it's pretty simple. You just wanna take a weighted average of the independent studies. Okay? So how do you want to define the weights? Well, you can define it in different ways but you wanna give the weight to the more precise studies when possible. The ones that have more power. So, a very simple method is just weight by the sample size. Right? And so this expression here is the number in your group, so the weight for a group, for study i is the number of samples in study i over the sum of all the samples. The total sample size. The more sophisticated way is the use the inverse variance weight, which is one over the standard error squared. I'm not going to give you the formula for standard error, you can look it up. But there's lots of variance here. The main idea is to understand why it's called inverse-variance. Is that if the variance is very, very high on a study, you wanna give that lower weight. Right? That means it wasn't a very precise study. And this could be because the sample size was low, or it could be for other reasons. Okay, and the standard error is one common method of associating with the variance in this particular case. But again, it's important to understand the intuition rather than just memorizing the formulas. Okay, and then again, as usual, there's a caveat here that this is all for a fixed-effect model. It assumes that every individual study is measuring the same true effect. And there is random effects models that help account for the fact when they may not be. And we're not gonna talk about random effect models. Okay, so finally one more term that was brought up in the context of publication bias was, well sorry, I didn't bring up the term. But one more effect that you saw on the plot was this funnel plot. If you remember, the funnel plot was high variance on one side and it went to lower variance on the other. And so the general term for this that I just wanted to introduce you to, Heteroskedasticity. Okay? And so this is when the variance itself is not constant. All right, so here, as an example, the variance is high. The variance is high over here and low in the middle and high again over here. Now, this is not necessarily a problem. There are ways to correct for this, but it's not necessarily a problem, because the estimates that you'll generate are still unbiased. Okay, but it can increase overall error estimates leading to a reduction to statistical powers. Right? So, you end up with this really high error numbers because all these guys count against you. Right? Even though you're actually doing a pretty good job in predictions. I would just say how this plot was generated. This was, again, simulated data where intentionally sort of varied the variance along the x-axis. Okay, so choose some X values and then sampled Y values according some distribution that varies in this way. And here I just took the exact same X value but repeated the sampling of the Y values many, many, many times. So it gave me this clearer spread, but you can see these solid bars are because the same X values were used across all the experiments. Okay, and the point here is that drawing the regression line over and over again, it didn't change all that much. Okay. All right, we didn't get anything to look like this. For example. So again, the problem here is that you might increase the error and you might lose statistical power because it overlooks a real effect. [MUSIC]