Finally, let's take a look at a complete numerical example. In this case again, just like the Google PageRank lecture, we are constrained to a slide on a page, and therefore the examples do not represent the challenge of scale or sparsity. If you recall, scale and sparsity, one of the big challenges for Netflix Prize. So, this only really serves the purpose of illustrating the five steps that we just outlined for neighborhood method. Here is our starting point, we are given, these are given constants, the rating information. Each entry of this big matrix R is rui. And you see that this is ten user by five movies, level users one up to ten, and movies A to E. And, it is 80 percent food, okay? So, 40, out of the 50 possible entries are already actually known. And this is, again, very atypical. Generally speaking, only about one%, instead of 80 percent of the entries are filled in a typical real data. But, if we only have one%, then we don't even have a single entry of known data in the training. Now, ten of these are represented by bars, this means that the rating simply does not exist for that user movie pair. And now, we're going to say let's hide ten out of the fourteen know data as test data, and use the remaining 30 as training date. So, 30 samples for training and ten to test our prediction. So, some arbitrary list, just pick these circled ten entries as test data, okay? So, we know that we just pretend we don't first and train with the remaining 30 points and see how we do on the on these ten points. First of all, the very lazy r bar, in this case, 3.83, in Netflix case, 3.6. All right. Step number one, baseline predictor. Let's construct the baseline predictor which is r hat ui equals r bar 3.83 in this example, plus a bunch of biases for each user for each movie. And this holds for all the ui pairs. Okay? So, we've got how many? Ten of these parameters and five of these parameters, we've got fifteen parameters in this case for the baseline predictor, and we're going to predict the right base line predictor parameters bu star bi star by solving the Li squared problem. The Li squared problem is, as you'll recall, A, AB minus C, and with two norm square, minimize this, and the solution is a transpose A times B star will be equal to a transpose C. So, what are the A, B's, and C's? The B is easy, that's our fifteen optimization variables, these parameters under training. B1 up to b10, and then bA up to bE. What about the matrix A and the vector C? We'll simply construct those at, like what we did in a small two movie one user example before. In particular, this A matrix going to be, we have this fifteen by, we have a 30 by fifteen matrix. 30, because we have 30 training points starting from r1a to r10, okay? So basically, all the points here that are not red circled are our training data, okay? And the, we have 30 equations and fifteen variables corresponding to b1 up to bE. And, the tables basically, this matrix A is 01 binary, is zero everywhere unless we are talking about that particular user and particular movie. Say, in this role, the first equation out of 30, we're talking about r1a training data so we have to write a one where it corresponds to b1, the first position, and the eleventh position, which is corresponding to the bA position. And, the C vector is simply the known training data r1a down to r10E subtracting the r bar. So, now that we have defined A, B and C we can use this equation as the solution to this Li squared problem and the resulting solution bu vector star and bi vector star are just partially shown here. This is ten entries, this is five entries. And then, you can stick these back into the predictor here, with op, optimize the bu, bi by training, and you got the baseline predictor, okay? That's step number one. Now, we can for sanity check, verify, in the textbook, you see all the entries. But here, you can see indeed, bu is quite positive, b, b1 is quite positive, and bA is also quite positive. That means, user one tend to give high scores, and movie A is well-liked. So, just to make sure that's the case, indeed, user one tend to give high scores, and movie one is well-liked, okay? A lot of fours and fives. So, the bias computation is indeed, making some sense. Step number two, is the prediction on the base line. So, the base line predictor using the optimized bu, bi and the average r bar will give you this prediction, okay? We call this r hat. If you look at this r hat, and remember, some of these are training data, some are test data. So these, for example, are the test data, okay? And, let's compare with the actual data, rui for, say, these five out of the ten. What do we have here? We have five, three, two, two, five. Those are the actual rui's that we were hiding from ourselves. Now that, let's see, five, three, three, two, one. Five, three, two, two, five. Five, three, two, two, five, two, two, five. Are they close to five, three, two, two, five? Yeah. Reasonable, I guess. 4.62. If you round up, and it will be five. But, if you compute RMSC, then there is a difference of 0.38. 3.49 versus three. This is getting a little off. Half a star off is quite off, actually. And if you, kind of how you treat this 3.45, you may round it to four, and you actually incur a significant error. And even without rounding, this difference of 0.49 is also quite big. This difference is huge, it's between two and 2.78. This difference is also big, and diff, difference is quite small. So, we pretty much get this quite right, this quite right, this sort of reasonable. But, these two predictions are kind of not that good, okay? One way you can deal with this is to use regularization to avoid overfitting in the advanced material. But in general, baseline prediction doesn't leverage any collaborative featuring techniques of global structure, so we're not doing very well. But, we're not done yet. So, at the end of step two we shift r by r hat, and we get R2d matrix. I'm just writing down R2d matrix two columns out of five corresponding to movies B and C, respectively cuz we're going to use this to illustrate in our step number three, which is construct the similarity matrix D for each entry of dij, let's look at a particular example. Let's say, we look at the movie B, C, okay? This is movie B, this is movie C. Let's look at, out of the ten users, how many users already rated both movies B and C? Well, that's one. User one rated both. User two rated both. But then, you've got other users either they have not rated some movie or they have rated, but we can't use those for training, those are for testing. So, we have to go all the way down to the user nine, where actually we have, data of user nine rating both movies B and C. So, there are three users, one, two and nine, that rated both, you, movies B and C and therefore can be used to compute the distance between movies B and C, the cosine similarity coefficient. So, we are looking at basically two vectors. 0.91, 0.11 minus 0.87 and 0.9, 0.31 and 0.33. Now, we'll look at the difference between these two vectors by looking at their inner product. So, 0.91 times 0.9. I'm sorry, this is minus. I should say minus. This is also minus. So, -0.9 plus -0.11, 0.31 plus -0.87 times 0.33, and I have to normalize. So, the square root of 0.91 squared plus 0.11 squared plus 0.87 squared times 0.9 squared plus 0.31 squared plus 0.33 squared, and this is the formula for cosine similarity coefficient dBC, which turns out to be -0.84. So, notice these entries are negative because after shifting the rating from baseline predictor, sometimes it's positive or negative relative to the baseline predictor. And these coefficients of similarities are sometimes negative because the two movies are dissimilar from each other. And these two are indeed quite similar, okay? This -0.84 is quite close to -one. So, we can say movies B an C are quite dissimilar. Now, we can do this now for all the pairs of movies to come up with this entire matrix D here. The diagonals doesn't matter, so just cross them out. The off diagonals are symmetric cuz the way we define it, dBC equals, dCB, okay? And, indeed, this is dBC or dCB, same thing is -0.84 as we just calculated, and you simply follow the rest of the, the same structure for rest of the entries. That's the end of the step three. We have constructed similarity matrix. Now, step four is to define a neighborhood. Let's say, a neighborhood for any movie is just two. Okay? So, the top two movie with a absolute value dij in the largest relative to movie j will be conclude, included in the neighborhood. So, let's take a example. For example, let's say, r hat 3D. We want to estimate user three rating movie D. Remember, user three rating movie D, we did not have that actual information. We can see here 3D is one of those that we we're hiding from ourselves so that now we can predict, see if we can get to two, okay? So, we want to find out what is r hat 3D based on the neighborhood method. Well, first of all, there is the r hat 3D baseline predictive component, then there is the neighborhood prediction component. This happens to be 2.78. Now, what is this? Since we decided the neighborhood consists of two movies, so for movie D, what are the top two similar or most dissimilar movies? This is for movie D and we can see that movies A and B are most similar or dissimilar in their absolute dij terms. Actually, both are very dissimilar, A and B are very dissimilar, okay? So, we have used, so far, the information about different people's taste on these movies, A. This is A up to E, right? A down to E, between movies D and A and B. And then, we take that information, okay? The fact that A and D are very dissimilar by people's record, and B, D are also quite different by people's record. As a way to say that, therefore, AB should be incorporated as we predict anyone's prediction liking or disliking movie D. So, following our formula, we have -0.97 times the weighting where the weighting turns out to be the prediction of the rating of user three on movie A, okay? Plus -0.73 times the rating of user three on movie B. And, this happens to be 0.9 in the shifted domain, and this happens to be -0.19 in the shifted domain. So, in particular, we see that the same user three whose taste on movie D we're trying to predict really liked movie A. But, since A and D are very different kinds of movies, they're so dissimilar, that means she will very likely not like movie D. This is a big positive number, this is a big negative number. The overall impact is to make this r hat 3D, and to be a very small number. And then, we just normalize now by the absolute value of the dij, which is 0.97 and 0.73. This, therefore, happens to be 2.35. In other words, the net effect of the neighborhood help is to reduce the prediction from 2.78 to 2.35. The actual answer is two. And indeed, in the base line predictor, it overshoot two by too much, very close to three stars. Whereas, with the help of neighborhood, knowing that A and B are very dissimilar from movie D, and user three kind of likes A and doesn't hate B too much, which actually tells us that she is going to really not like D a whole lot. With that information, we push 2.78 down to 2.35 which is, indeed, much closer to the actual ground to two. If you computed the, our squared error between two and 2.78, you can see it's much bigger than that between two and 2.35. So, now we can compute this for all of the ten items that we want to predict. And it turns out that, we get a much better improved performance in the baseline predictor, r hat. We get the error of 0.51 and 0.58, respectively, for the 30 known training data and the ten test data. But, this get improved into a neighborhood predictor r hat N where at the overall, the root mean square error, for the training data is as small as 0.32, and for the ten test data is also small as 0.54. Again, it's what these ten predictions that really matter in Netflix Prize. You may think these are very small numbers, again, in a scale up to five stars, half star is a big deal. And even improving on the next digit is often going to make a big difference in what we recommend, and that's what the ten percent goal of Netflix challenge is trying to do here. This represents seven percent enhancement by neighborhood method over baseline method. And it turns out that, if you use neighborhood method plus temporal dynamics plus regularization idea of Li square, plus a couple of other tricks, you can easily get to almost nine%. Meaning, you can actually win the first entire year of progress prize, and more of enhancement already sign match. But, getting the other one, or little over one percent to get to ten percent target of enhancement over sign match took two more years of effort by many smart people around the world. Thousands of teams, tens of thousands of different submissions and two years time to enhance that 1-1/2%. That's why a lot of people believe that half Netflix said this to be eleven or twelve%, it may not even be achievable using the known data, that we will never know the exact answer. But, what you have seen so far with a few tricks get you quite far in that prize, if you were participating back in 2006. So, that's the end of this longest lecture in the entire course. We looked at Netflix Prize, a very interesting scientific story as a special case of recommendation system. In particular, can help collaborate a future leverage and similarities among users or, as in our example about movies, to enhance the accuracy of prediction, say, measured by root and square error. And if you did minimize root and square error, which you don't have to, as we saw, this would lead to often a Li squares problem, which is a special case of convex optimization that we'll see over and over again in the course. This is the, just a very brief terminology introduction on convex state today. So, this lecture is long in part because not only we're talking about collaborative futuring as a learning idea, but also start to introduce convex nonlinear optimization. It's also long in part because Netflix Prize is very interesting story to show the timeline and have some details of that competition. Now, this special case recommendation system will lead us to the next lecture, Lecture five, on recommendation system on Amazon in a different context, and a different question over there. So, see you.