Let's now go beyond simple, naive averaging and incorporate the size of the population of the reviewers. And this will lead to what we call Bayesian ranking. But before we can go there, we first have to look at what's called a Bayesian estimation. Named after the mathematician Bayes, whose work was then, further extended by the famous mathematician Laplace. And Laplace raised a very interesting question, what's the chance that the sun will rise tomorrow if you have observed that it has been rising every morning in the past, say, 100 million days? Now, you may think this is a funny question. But, for now, ignore the fact that you know something about the underlying physics of why the sun rises. Purely view this as a question that you observe. Something has been coming up every morning the last 100 million days, and then what's the chance you predict it will come up again tomorrow? Can you say, well, can be exactly 100 percent mathematically? Again, we're not talking about underlying physics. About this is such an overwhelming number. So, what should it be? And this is an interesting thought experiment that we can simplify little further to say, suppose I give you a sequence of n experiment and I say, s out of n experiment, for s is smaller than n I've served one. Okay? And I ask you that question, what's the chance the next experiments also return one? Now, without going through the foundation of probability theory. Intuitively, the answer is just s over n. That's in the past s out of n chances runs give us this result of one as, well, our observation. So, call this question number one. Suppose now, I switch to another question and say that, also run n experiments. And the experiment is actually that there's a coin that's loaded, and I flip it. If it's heads, then I return one. If it's tail, then I return zero. And again, s out of the last n runs, I observe one. I'll you ask the question, what's the chance that next time you'll also see one? You may say, hold on, isn't this question the same as the last question? Shouldn't the answer also be s over n? Well, not according to the Bayesian view of probability. The Bayesian view is a very powerful and for some time, a controversial view in probability theory community. But, what it says is that, now that I've given you some prior information that the experiment consists of flipping a loaded coin, then you'll be able to make some model about that based on the observation in the last n rounds. And therefore, your answer may be different. What is that answer? Let's derive that in the next five minutes and I will come back to answer this question, why is it different from s over n again? The essence of the Bayesian view, the philosophical underpinning is captured in this picture. You got underlying model. Later in the course, we'll also see different kind of latent factor model, we'll see hidden model, we'll see reverse engineering of network topology, as well as protocols. Philosophically, they follow a similar spirit. In this simple question, the underline model is just captured by a parameter p. And as the chance that a single coin flip of this loaded coin will result in head, and therefore observation one. Now, different piece will clearly lead to different observations. That goes without saying, but Bayesian view also says different observations telling me something about p. And, more observation I have, the better I can build a model for p from which I can then do forward engineering to predict the future outcome. You first reverse engineer the p before you make prediction. So, in our case, we say that, if you know the value of p, then we know that the chance of seeing s out of n flips heads is simply a binomial distribution. It is p to the power s, cuz observed s such cases. One minus p to the power n minus s, cuz we observed n minus one of such cases, and there are edge whose asked possible [unknown] arrangements of that sequence of s out of n being one. So, this is just a binomial distribution, and we all know that for a fixed p. What is now flipping our heads around and turning the table is that, since that's the case, then the probability distribution of p, and let's call that f of p, should be proportional to this observation frequency. If we count the frequency in the observation of heads, then that will tell us something about the underlying probability distribution of this value of p. And let's just pause for one second because it sounds intuitive, it's actually counter-intuitive for a site. Cuz we're now saying that let's go ahead and predict, we're saying that let's build a model of p and that model says that p's distribution should be proportional to our observation. This proportionality principle is what Laplace did to extend base understanding of the relationship between observation and model. And I say it's proportional because the self is not a, a distribution. We have to normalize it by integrating overall the possible p's and the p there, a p can range from zero to one. And now, this is indeed a probability distribution, okay? And as a function of p, that's the probability distribution f of p. So, all we need to do now is to evaluate this integral and then do the division skipping those detailed steps, cuz that's not what we care about in this course. We get the following answer. It's n plus one factorial over s factorial n minus s factorial times p to the s, n minus one minus p to the n minus s. Okay? So, that is the probability distribution of p as a function of n and s. As a function of, how many runs have you carried out and what is the number of ones that you have observed? And now, we're basically done because we know the conditional probability of observing another one given a particular p value is just p, right? Because we know what p is. So, condition on that, the chance that we're getting, a coin flipped of head is just p. So, we have to unconditionalize it by integrating this answer p, across the probability distribution f of p for p from zero to one. And, substituting that complicated looking expression f of p and doing the integral against getting the steps from integral tables, the answer is remarkable and remarkably simple, s + one over n + two. Let's recap. We started by saying that, you want mid prediction, right? But, hold on first. You know something about a model. So, based on your observation, first view of the model around p, and that's how we get to the distribution of p. Then, you make prediction and the answer is s + one over n + two which is clearly not s over n. So, mathematically, you know this is the right answer if you follow Bayesian analysis. But, intuitively, why not s over n? Well, here's one way to intuitively make yourself feel comfortable. The moment I tell you this prior information about a model, it's as if I've told you that I've run for free two trials. One trial shows up heads, one trial shows up a tail. So, when you tally you should say, it's n plus two trials. And the number of heads I've observed is s plus one, okay? Because the impact of that prior knowledge about the underlying model is, roughly speaking, intuitively speaking, equivalent to as if you had free runs. Two of them, one showing head, one showing tail. So now, we can take this philosophy to understand how to do ranking, of competing product of the same category on Amazon. And we know that knowing heads shows up 100 out of 103 runs is going to give us something different than observing ten heads out of thirteen runs. Similarly, on Amazon, we know that knowing there's a lot more reviewer of a product compared to its competitor should give us some information. So, suppose you've got a list of products indexed by i. For each product i, there are n sub i reviews. And, the average, simple naive average is r sub i, okay? Then, we say, look across all of these products in the same category. You can add up the ni's, call that total number of review for all the products in the category N, okay? And then, you can look at the reviews, summation of niri, okay? Divided by N, that is the average, total naive average across all product category, call that R. So, if I know a certain product i got a lot of review, ni, relative to R, I should put more trust to the corresponding ri. Whereas, if there is very few reviews ni relative to r, then, well relative to n, I'm sorry, very small ni relative to n, then I would rather put the trust on the average review, okay? It's like saying that if this product got a lot of reviews, I should trust its own average. But if it got too few reviews relative to n, then I should pretend as if it is an average product across the entire category. And, it turns out that the Bayesian adjusted value, ri tilde, follows the following expression. It is NR, which is really is the summation of niri, plus niri divided by N plus ni. So, we can view this as a sliding ruler. On the one end is your product i's own average review, ri, on the other end is the total average across all products in the category. And we say that the actual ri tilde, the Bayesian adjusted rating average is somewhere between the naive average of your own product and the total product depending on how much i can trust this ri. If this ri has a lot of ni backing it up, I'll move closer to ri. If it's got very few of them, then I'll just say, you know what? I think I would rather not trust this ri that much and more towards R, the average rating for all the products. So, ri tilde is sliding somewhere in between and its sliding according to this formula. Let's take a look at two extreme cases. In extreme one, one extreme cases, this ni for a certain product, is so small that we can ignore it compared to the N. Says, nearest zero. And say, this is one and N's ten trillion. Alright. So, we can pretty much ignore this term and this term. And in that case, I ri tilde is just R. It moves all the way to this extreme end. Or lets say, ni for certain product, is n over two. So, half of the other reviews in this category goes to this particular product. So, relative to the other competing products, we can say this product, i, got a lot of reviews. So, we should be able to trust more this rating. So, if we substitute ni over equals N over two into this expression, what we see is that, ri tilde now equals two-third R plus one-third ri. So, it's moving much closer to ri now. And it turns out that this formula is used exactly in Internet Movie Database, IMDBs ranking of the top 250 movies of all times based on all the ratings of these movies, okay? It doesn't use a simple naive average because some movies got a lot more reviews and ratings than other movies. Instead, it used this Bayesian adjusted, NR + niri over N + ni. The relative sample size of the review, ni, relative to the total across all the movies plays a role here. Here's another example, Beer Advocate. This is a website that ranks all the different kinds of beers around the world. And it use almost the same formula, okay? Each Bay i's, Bayesian adjusted ranking r tilde i equals N min, I'll explain what this is in a minute, times R plus niri over N min plus ni. Where N min is the minimal number of rating that you need in order to even show up in the top beer ranking on this website. So, why use N min instead of N which is the total number accumulated number of ratings? One reason is that, if you use this N, as time goes by, the total n would only increase in the dynamic range of r tilde i will be shrinking. So, Beer Advocate decided to use a twist of this formula and say, instead of looking at the total number of ratings, just look at the minimum ratings that you need in order to show up on the website. So, in fact, you can pick this as N, as N min, or as some other kind of number that somehow corresponds to a notion of the total review population. What about Amazon then? Does Amazon use Bayesian analysis in this ranking? And, what else that it use? We'll come to this in the next and the final segment of this lecture's video. But first, let me highlight to draw back some limitations of this Bayesian ranking. Number one, this is useful for ranking. Therefore, you must have a backdrop of a scale of all competing products number of ratings. If you just want to do adjustment of the number of stars, say, five, 4.5 stars from 121 reviews, you can do that, okay? You must also know, what about the competing products? How many reviews and what's the naive number of stars that they received in order to do adjustment, in order to lead to a ranking of the product? So, that's why on Amazon's home page, you can click through a number of links to look at the ranking which, actually, encompass Bayesian. But, if you look at just number of stars, they still show you the simple naive average number. And you just have to process in your own head, by, in calculating the impact or number of ratings somehow according to your own notion of n. The second limitation is that, this formula implicitly assumes actually a Gaussian distribution of the review scores. As we mentioned, that often they have a bipolar, bimodal distribution. Or in general, multimodal distribution in which case would lead to a slightly different and more complicated formula. But, we will not have time to go through that set of material. So, what we're going to do now is going to go through two examples on Amazon, a small and a larger example to figure out what Amazon does and how does it use the Bayesian viewpoint to do ranking.