What made Galton's experiment so successful in demonstrating that sometimes simple averaging works. There are three key factors determining what sometimes might be. One is the definition of the task. In Galton's example, there is a correct answer to speak of. There is an objective answer, that is the actual weight of the ox. In many of what we are looking at however, there is no objective answer. It's very hard to say here is the grand truth. The correct average rating should be this, and simple averaging doesn't achieve that. We don't have a systematic way to argue that in that way. The second is unbiased and independent estimates. Both are very important. First, unbiased. Everybody get a good view of the ox. Okay, there's no glass door, between them to give a distorted visual. And everybody put in their number into the bin, independent of the others. That each write down a number, one, two, eight, nine, nine, four, three and so on and that they put them into the bin without looking at the others. This is very important indeed. In about two lectures time, we'll look at a scenario where dependency of action will break the wisdom of crowd and turn it into an information cascade. The third one is that there's enough people participating. Now 787 apparently, was a big enough review population. If there where only two people, probably it wouldn't work. So somewhere between two and 787, for this particular task under and unbiased, independent estimate, it worked quite well. The simple averaging. Now, back to our Amazon, simple averaging number. Okay? 4.5 star, 3.8 star. How can we better turn into that vector of review into a scalar? Other than averaging of the better approaches. There are quite a few potential intellect or discipline that can help. One is natural language processing. If the rating associated with five star. Is a review with a lot of superlatives and very emotional terms. Okay? Or the texts are associated with a rating one star is very emotional and clearly charged with a lot of biases that maybe we can discount those. Statistics, it's often believed that only those who really care about this product/service would enter a review on Amazon. Okay. So naturally there will be a distribution of people who are very negative about something, then they say I aught to enter this on Amazon, and there are people so happy with it, they I aught to enter it on Amazon. So sometimes, the product itself has a bi-modal distribution because some people like it, some people don't like it. But independent of that there is also a bi-modal distribution in the sampling of the over all population. Most people who are in between just don't care enough either positively or negatively to enter reviews. So, by understanding the underline statistics, we can may be better process, this averaging. The third one is of signal processing of various kinds. We can say that, maybe, we will look at the timing of these reviews. We'll look at how recent is it. Is it a bad processing phenomenon, like what, what we briefly mentioned, in the last lecture on Netflix. And we will later look at a specific signal processing matter and bejing estimation and maybe voting. Maybe we should ask them instead of just aggregating into a scaler like this. We should say, why don't you, each of you rank? Each person rank all the products. For example, all the LCD, HD TV out there. And then I'll aggregate the list from Alice, from Bob, from Chris, and so on. And then generate a ranking order in that way. Instead of condensing a vector review per product into a scaler and then rank the scaler, I ask each individual to give me a complete ranking of the competing products. And then I'll aggregate these rank order lists. This would be a very different approach. It's impractical for Amazon. Because very few people will be in the position or in the mood to do a company ranking according to some scale of all the competing products. But in other scenarios, such as voting, like what we'll encounter in the next lecture, this will be useful. So before proceeding further, let's take a look at a couple click examples. One example, this is from a certain 2011 day, we picked, two TVs. It was HD TVs. One from Phillips, one from Panasonic, okay? This one from Philllips got four star out of 121 review. This is a simple arithmetic mean. Simple averaging. The Panasonic one got almost 4.5 stars, but out of only 54, reviews. So between four star, 121 review, versus 4-1/2 star out of 54 reviews, which one would you buy? How much do you trust this four or 4-1/2 star depending on the number of reviews? How do we quantify that into a, And more generally, if we could generate a whole histogram of one two three four five stars. And look at a bar chart. Would the spread of the stars also make a difference? The average might be the same. But this would look very different from no two, three, four star, but only one in five stars. On the one hand, this kind of, histogram will alert you, because it's a very bipolar opinion on the product or service. On the other hand, you may say, you know what? I'm among those who hold this belief. And therefore, this suits me very well. A second example, this is again in 2011, particular month, looking at an ipod touch review, one chart records the most helpful reviews. So we look at the review of reviews, number of people who found that useful, the percent of people who found that useful and we, only because those, they are believed to be very useful reviews and record their associated numerical ratings. On the other hand, we a have a chart that just records most recent ones. You can see a clear difference between the two. Now maybe there is a generational upgrade past a certain point in time. This is time and therefore we won't like detect when that happened by looking at say, the average scores. And more generally speaking we may say that a rating, or ranking based on rating, is not a static object. We have to look at the entire time series. You can take full course on time series analysis, for example. Now let's say there are three different behaviors in this cartoon A, B, C, where the time axis is on the time scale of say a, month. If this time scale's minutes, actually, it doesn't quite matter what's the fluctuation. But let's say this is on the scale of month and this is the review that you receive aggregated over each week for example. This pattern is clearly cyclic and you may say, gee this product, you know, is not receiving a stabilized review. This one seems to be stabilizing around a pretty good number, a little above four. And this one, it's hard to tell, hasn't stabilized yet. Can I trust the average over a certain moving window of a certain duration, say the past most recent three months, or not? It's actually hard to tell. So now we have raised so many questions. Starting from, can we trust simple averaging, to all kinds of, statistical, signal processing, natural language processing, kind of questions on how to enhance our ability to understand. Turning a vector into a scalar and then ranking base on that. Now we don't have as many answers unfortunately, most of the questions we just raised do not have very stable answers in the context of Amazon, that's why it is still an art to decide when should you trust the average rating or when can you trust the ranking based on average rating provided by Amazon. We'll come back to this example very soon. So what we're going to do now, is to look at two cases, where we do have a pretty good answer. One is averaging a crowd, and looking at the wisdom of the crowd. The other is to look at evasion estimation and adjusting the ranking based in part on the number of ratings.