Welcome to lecture five of Networks, Friends, Money and Bytes. The question we're going to formulate an answer today is a very important one in our daily lives. How can I trust an average rating on Amazon? Now, we have been talking about recommendation, twice already for Google's page rank is recommending web pages through a ranking that turns a graph of hyperlinks web pages into a rank order list. And as for Net FLix, recommendation engine. Like those collaborated filtering methods used in NetFlix prize. It turns the by part I user movie graph or they use a movie table in to a set of rank order list, so one list per user. So this one is individualized, but this is not. And today's topic is recommendation by Amazon. And we're talking about specifically, recommendation in form of average rating. And then, in the next lecture, we will keep talking about aggregating multiple information into a simple single one. Today's aggregation says that, for each product there is a list of scores. Now, we want to change that vector into a scalar. And then we can say let's compare many products by looking at their scaler, writing of some kind. And then this will lead into a ranking, based on the scalarized version of the vector. Now you may say that one easy way to convert a vector into a scaler is just to take the average. And indeed that is what is shown on Amazon. On a Amazon page you've got a product and then there will be the average number of stars. Say, 4.5 stars, 3.8 stars and so on. So, that is a scalar representation of the entire vector. Where a few people have the time to go through all the reviews, or all the ratings in each review. And therefore, this is a nice quick summary. Many of us, based our purchase decision, in part, on this average number. But Quicken's, when can we trust it? A very simple question would be, should be a product with 4.5 stars and two reviews, or one with four stars, and 1,000 reviews? Now this a, contrast is so big that most people would say, I'm going to pick this one because 1000 reviews. Then it must be a very robust four star whereas two reviews, I don't know what would happen if there were more reviews. What if it is, instead of this, 4.3 stars with 100 review versus 4.1 star with 200 reviews. Now that choice would be much harder to make on intuitive base. How, so how can we quantitatively understand this choice? Now, review system is used in so many places, not only on Amazon, but on, for example, every step of the academic enterprise from the application to graduate school to a tenure track assistant professor, to paper review, proposal review, tenure and promotion. We all rely on peer review systems. In general, peer review system, especially, like those, those on Amazon online store, consist of three tops of rating, a text of review, and reviews of review. Rating is simply just a number, usually one to five stars. But it could also be one to ten. It could be -three to +three. In any case, it is some kind of number, usually integer and a review is a text. The text could be just a couple of lines or could be five pages. And then reviews of review which could be thumbs up and down which could be do you find it useful or not which is it again a binary review. It could actually be a text of review on this review. If you have a thread allowing rebuttal back and forth. Okay, we can also implicitly understand the quality of a review through for example, reputation score of the reviewer. So there are many challenges in understanding how to best use a radio system. One is the gate keeper problem. Can anyone enter review or only those that actually can be verified if brought the product or used the service. Is this anonymous, or is it ID based? If it's anonymous, how can we prevent people from just entering random reviews, or competitor, entering bad reviews, or the owner of the product or service entering really good reviews? There's also the issue of scale. If we give the choice of one to ten, often seven becomes the mode of the distribution with the bi-modal distribution around it. From one to ten. Somewhere around seven. If instead, the scale is one to five, the star is different. Is that, however, large enough dynamic range? If the scale is minus five to positive five, it actually changes people's psychology, because minus sounds like really bad. If the scale is only one to three, it really narrows people's choices. So gate keeping and scaling really matters before we can start even talking about radio systems. Then, we face this challenge with mentioned number of reviews, this is a particular focus point in this lecture. We know that more reviews is better, assuming it is not random or, artificial reviews. So, if it's truly, valid reviews, we want more. The question is, how big is big enough for review population such that the average rating is trustworthy. And in fact can we make the rating, the average rating, modified, adjusted by the review population size? Okay. And there's also the question of what is the performance metric? This is very difficult. First of all, depending on what you're reviewing. Some of them are very subjective, like movies, in IMDB or Rotten Tomatoes. Sometimes they are very objective, for example electronics on Amazon and sometimes it's in between for example, hotels, or trip advisors, or restaurants and open table. It's also driven by whose needs are we thinking about? Okay. Is it the reviewers? Or is it the owner or the seller of the product and services? And how do we quantify the notion of usefulness to different parties? The lack of universally agreed performance matrix is one of the reasons why this problem is very challenging. For example, you may disagree with root means square arrow L2 norm from last lecture on NetFlix challenge. But most people agree this is a reasonable proxy of the actual performance. Where as for Amazon average rating there's no consensus yet. Now, given all these challenges, it might seem that it is very difficult to come up with a good review system. These average the entries by different reviewers is not going to work out. However there are also well-documented anecdotal stories that show the potential of reviews is more even as simple averaging. A famous example is an experiment run by Galton in 1906 on a farm in Plymouth, United Kingdom. So farm's festival and there were many people, most of them were not farmers themselves and there was an ox. That's my artistic rendering of the ox, okay? On a stage, and there are many people standing next to the ox. They each played a game of guessing how heavy the box is. It turns out there are 787, who entered their guess on a piece of paper, and put into a bin. And Galton say oh, it is impossible for these people, who are not real experts on ox weight guessing, to be able to come up with an accurate guess. So, he took all these numbers, into a simple averaging, the number is 1,197 pounds. It turns out that the actual weight is 1,198 pounds. Almost exactly accurate. So this 99.9 percent accuracy is just amazing. And this has been told and retold many times as a story illustrating the potential of wisdom of crowds. So, our question is, how can we generalize this observation? And can this simple averaging carry over to Amazon?