So, we're going to go into neighborhood method now. But, before we can do that, we first need to go through what's called a baseline predictor. And before we can go through a baseline predictor, we first want to talk about the general notion of parameterized model. Now, we're going to see the parameterized model in a very different context in the next chapter with Bayesian analysis. In general, we want to build a model that is some mathematical representation of the problem we want to solve. Sometimes, this model carries with it a bunch of parameters, we'll see a lot of parameters today. And these model, parameterized models will go into the so called training phase first. Where you have some observations, but you also have the ground truth. And you stake the observations, make prediction, and compare with ground truth. But then, the difference is what something they want to minimize by tuning the model parameters. After this step of training, you have optimized model parameters. Then, you go into the actual real action, which is the prediction step, where you take these optimized parameters in the model, optimized by the training phase, with the known ground truth. And then, you, you take the observation to make predictions. Sometimes, other people have the ground truth, like Netflix had the ground truth for the Netflix Prize, but you don't. So, training followed by prediction. And training phase would train your parameter just like a voice recognition system now on your computer or phone or car. And you first train them by speaking certain known sentences. These sentences are the ground truth, and then the computer will take in the observation which is the recording of your voice, and then train the parameters before going into the actual action of making predictions in the future for you. All right. So, having said that, let's go into the baseline predictor. This is the step number one before we can afford to go to step number two and neighborhood predictor. So, baseline predictor does not utilize the global structures of the user's hit movies rating. We're going to start the baseline with the baseline of the baseline, that's the laziest predictor that you can think of the average predictor, let's call that r bar. R bar is simply the summation of all the rui's across all the ui appears where there is a rating and divided by the total number of set ratings. That's it, simply the average. It turns out that Netflix's Prize training data, 100 million data points, r hat is 3.6. That is, over all these 100million ratings, the average is 3.6. And we can say, well, that is our prediction. We predict that all the users and movies interactions will be 3.6 stars in the future. And clearly, this is a very lazy, we can do a lot better than this. For example, we know that certain movies are better and certain reviewers are harsher. So, let's say, the predictor, r hat is r bar but with some bias terms. So, b, bias terms with a regard to the user, and bias terms with regard to the movie. So, r, r hat ui is the average predictor, the laziest one plus biased term on the user, on a user and some biased term on that movie. And now, we can look at the root mean squared error. We can also take the square root of this root. So, just look at the mean square error, if it's, that doesn't change our answer of the optimal parameters. So, we look at the summation of the difference squared, difference between ru, ui, and r hat ui squared. We can also delete the step of dividing by the total number of, of ratings because that's just a constant, it doesn't change our answers. So, this is our objective function for the minimization. Minimize over the set of bu's and bi's, the summation of the square of the differences between the ground truth, rui, which you have in the training phase, and the prediction r hat ui, where r hat ui assumes this formula. Okay? So, this is the function involving some squares of the bu and bi terms. R hat is not a variable because it is just the average, say, 3.6 for Netflix, and the bu, bi are your variables. That's an optimization problem, you've got objective function which is the square. Definitely, not a linear programming anymore, but we'll see if this is a nice nonlinear programming. And, there's no constraints, okay? The variables are the bu, bi's, and the given constants are the known truths, rui's, and this simple scalar, r bar that describes the baseline predictor training optimization. So, let's take a look at a particular example of this. Maybe get rid of the r because we are effectively taking the square root of the error, the mean square error minimization. Or, the total square error, since we're not dividing by the number of ratings either. Let's say, we have just two movies A, B, and we just have one user, one. So, this is a ridiculously small example. In fact, it's so small that we have more parameters than we have data points. We have three parameters, bA and for the bias terms of the two movies, and the b1 which is bias term of, for, of this single user here. We have three parameters and we have two norm ratings from the ground truth, r1A and r1B. And, what is our minimization problem? That is, looking at the square of the difference between the prediction and the ground truth. Since we have taken the square, it doesn't matter which way we write out the different list say we write out r hat minus r, that is b1 plus ba plus r bar, okay? This is our prediction. R hat of 1A minus the actual 1A, and this thing squared plus the other error, which is 1A, b1 plus, plus the average r bar minus the actual 1B rating squared. So, this is the r1B hat minus r1B, this is the r1A hat minus the actual r1A, okay? And we take the sum of the squares of the differences, and that is the object of function. We want to minimize this object of function subject in no constraints over the three optimization variables namely, our three model parameters, bA, , b1. So, our question is, is this an easy optimization or not? It's clearly not linear but it is an unconstrained quadratic minimization over the variables. It turns out it is a very easy problem. It is so called least squares problem, okay? While minimizing the so called convex, we'll come back to this in a minute. Convex quadratic function over no constraints, and this is a very nice nulling optimization. In general, we say, a least square problem is the minimization of the following vectors, l2 norm. First of all, what is l2 norm? So, given a given vector x1 to measure size, one standard way that is l2 norm is the square root of the individual entry squared summing over all the entries. We often don't like the square root thing, we take the square of the l2 norm. So, square of the l2 norm is simply the sum of the xi squared over i. It's a very standard way to measure the length of a vector, and we would like to, in this square, minimize the l2 norm square, the length of the vector, Ab minus c, where b is our variable, A is a known matrix, c is a known vector. Say, if A is N by M, then b is M by one a column vector, and C is N by M by one, another column vector. In the particular case we just saw, we can also write this problem minimizing this expression in this the standard form. It looks a little weird, but you sort of see, it's alright. The matrix is one, one, zero, one, zero, one. The variable vector we have is b1, bA, . And the constant vector term we have is r1A minus r bar, and r1B minus r bar. This is our A, this is our b vector, this is our c vector. Just to conform to the standard expression that we want to minimize in the least squares problem. You can easily check, minimizing this is this, l2 norm squared is the same as minimizing this expression, okay? Example, this is just saying b1 + bA - r1A + r bar. And that's exactly this term, alright? And then, you can write out the other term. And then, just by definition of the square of the l2 norm, you see we are recovering exactly the same object of function. So, we are, in the baseline predictor problem solving a least square problem. So, the question is, how do we solve this square problem in general? Okay? Well, in general, we want to say, we want to minimize Ab minus c l2 norm squared which can also be written out as Ab minus c transposed times Ab minus c. Just it, multiplying itself in product in with itself. You can write this out if you're not comfortable with the matrix notation in scalar format. We can write out a little two by two matrix A, you know one, two, three, four and a little vector b1, b2 minus little vector c1, c2. And then, you can see that the derivation coming up is no mystery. But, to save time, let's use the forehead matrix notation. It's almost similar to our scalar operation, okay? This will be Ab transpose times Ab minus Ab transpose c minus another Ab transpose c and plus c transpose c. And, this can in turn be written simply as by transpose rule, b transpose A transpose Ab minus two times b transpose A transpose c plus c transposed c. Okay? Now, we say, if we want to minimize this expression over the variable which is b, for a given A matrix and c vector, you can just take derivative of this with respect to b, okay? It's just like taking derivative with respect to say, x of the quadratic scalar function. Say, you know, x squared plus 2x minus three, we take the derivative we get 2x plus two, right? Here, we take the derivative with respect to b and we get the following. Two, cuz there are two b's here, times A transpose A times b minus two. Because of this two, take the derivative back to b, we've got A transpose c. Alright. Now, we set it to be zero, we set this to be zero. Now, why is setting the derivative to be zero going to give us the right answer? We'll come back to that in a minute, okay? It sounds intuitive. But, we will soon see that there's some convex to the argument behind that. But, right now, let's just set it to zero. And, that means A transpose A times b equals A transpose c. This is the solution to our least square problem. If you want to solve this problem by picking the right b, well, the right b is the answer to this equation. What kind of equation is it for b? It is a linear equation. A transpose A is just another matrix times b equals another vector, which is A transpose c. So, as long as we can solve this linear system of equation, we get the right answer, b star that solves the least square problem. This is a general thing that minimizing a convex quadratic function boils down to solving linear equation cuz you take derivative ones, you get quadratic down to linear equation. So, everything that we are talking about eventually now boils down to a simple least square solution. So, let's take a look at this one more time. We say, our baseline predictor is the lazy r bar plus buyer's term for the users and the movies. Then, you look at the difference between the actual r and this r hat in the root mean square error sense, which means you're looking at the square of the l2 norm. And, you're minimizing over bu, bi that function. It turns out picking the right bu, bi and therefore, completing this space-line model, boils down to solving this kind of problem. And this kind of problem is called least square in this format. And the least square solution is simply the solution to this linear equation. We'll go through a complete numeric example towards the end of the lecture, but this is a quick summary of where we have been, okay? Now, you may wonder, this is kind of complicated. Why don't I just pick bu, bi to be the average rating of all, of course, all movies for this user an, an average rating across all users for this movie? That turns out, that's a reasonable guess but it does not necessarily minimize our error metric which is root mean square error. So, all we're doing is to say, if you believe in this model, stick it into our error term. Then, you got to solve for least squares problem which can be efficiently solved by solving linear equation, just like Google PageRank solving linear equation. But, over there, it's a very different set of equations. And then, once you get the optimal bu star, bi star, stick it back in here, you get your predictor that minimizes RMSE. It turns out that, that's almost a good, good idea, except you need something else. We often encounter was called overfitting. This is in the textbook in the advance material. But basically, it says, that you can add so much parameters, all these bu's and bi's, and you can find tune them so well based on the ground truth and the training data that it loses flexibility and predictive power of the actual prediction. So, we want to somehow minimize the impact of the parameters. We don't want to rely too much on hindsight perfection, We want to leave some room for foresight cuz it's ultimately foresight that matters in predicting or recommending. One way to continue the weight of the parameter is to actually minimize the size, say, l2 norm of these bu's formed as a vector and these bi's formed as a vector. That's called the Regularized Least Squares. If we use quadratic norm, then l2 norm is still a least squares, it's just a bigger least squares problem. But, it does try to contain to much reliance on too refined parameters based on hindsight. So, after the baseline predictor, what we have is the following. We got the actor rui, we've got the baseline predictor r hat ui which is the same as rui minus r bar, the lazy average plus the optimized bu and optimized bi, based on solving the least squares. And we call this difference a shifted r, called r tilde ui, just a shorthand notation. In other words, we have removed in this r hat r tilde ui, the impact of the average and the impact of the per user bias, and per movie bias. And we're going to take this as the starting point into our neighborhood predictor to look at similarities across movies and across users. But before we do that, just a very quick detour for five minutes on the notion of convex optimization. Now, we've seen linear programming which is a special case of convex optimization where we minimize linear objective functions over linear constraints, linear equations and inequalities. So, pictorially, that means, you use linear in, inequalities to cut out the space of the variables into something called the polytope with sharp corners. And then, you slide a bar, which is a straight line cuz it's a linear function representing your objective function across this constraint set. So, imagine an airplan, airplane flying above this constraint set, okay? And we want to identify the minimal value of this objective function over this kind of constraint set. It turns out, what's really important is that you want your objective function to be convex, like a convex quadratic function, such as x square over this space of constraint set that is also convex like an oval or circle. It's convexity that determines the watershed between easy and hard optimization problems. So, what is convex optimization? It refers to minimizing a convex objective function. So, we've got to define what is a convex function, subject to a convex constraint set. And, to be computationally useful, this must be a convex constraint set that can be easily represented by some upper bound inequalities on convex functions. So, we have to define what's a convex set and what's a convex function. We will encounter other forms of convex optimization in network economics, in internet pricing, in congestion control, in routing, and so many other topics. This is our first brief, quick introduction to this subject. And, we'll see in this and future lectures that convex optimization is easy in both theory and in practice. But, we just want to take three more minutes to introduce the basic notation before we bring them back, just in time for the future applications in later lectures. So, what's the convex set? What's the convex function? We can actually go into a whole course on nothing but convex analysis, and there are such courses in Math Department. But for today's purpose, we just need to specify with some pictures. A convex set is basically a set where you can pick any two element in the set, and draw a line segment in between. And entire line segment also lies in the convex, in the set. Then, we call this a convex set. Say, well, this sounds like all sets of convex. Well, no. Some sets, for example, like this set, you can pick two points in the set but part of their line segment, this part, does not lie inside the set. And, this is a kidney-shaped thing that is a non-convex set. So, generally speaking, we say, a set is convex if we pick any two points, a and b, that in this set, okay? C, pick c. Then, the entire line segment between A and B, which can be represented as theta a plus one minus theta times b. For some theta between zero and one, for example, when theta is half, then it is the midpoint between a and b also lies in c. For all such theta, that's the key word, not for some theta but for all theta. This is the representation mathematically of the line segment between point a and b. And, by the way, say a, b are vectors in general, in, in dimensional space So, pictorially, it just says that, a convex set is a set where the whole line segment lies inside. It turns out the most important geometric property of convex set is that any two convex set that don't touch each other can be separated by a straight line. So, that one set lies on the one side and the other lies on the other side. Let's say, will this be true for non-convex set too? Not really. For example, one set is convex, the other set is non-convex. They don't touch each other and yet, you cannot separate them with a straight line. Of course, you can separate them one way or another cuz they don't touch. But, you can't separate them with a straight line. It turns out that this is the most useful property of convex set, but we don't have time to go through the implication of that in this lecture. Now, having specified what's a convex set, we define what's a convex function. A convex function intuitively is a function that curves upward, okay? And some function, for example, curves upward before it curves downward. Now, we know that if you look at a function f, its first derivative tells is monotonous is increasing, is it decreasing function? And the second derivative tells us whether it curves upward or downward. Let's say, the function is a single variable function, f of, of scalar variable x. Let's say, quadratic as well, we're talking about quadratic today, is x squared, okay? You can add a few other, a few other terms, say, 2x min, plus three, okay? This, if you plot it, curves upwards. Or, another function, minus x square plus 2x plus three. If you plot it, it curves downward. Both of them, quadratic function, one is a convex, the other is not. In fact, the other, if you flip it minus f of x for this function is convex. We call it function whose minus of that function is convex, a concave function. So, this is convex function, this is a concave function. And, of course, it doesn't have to be quadratic, and it doesn't have to univariate. It could be a multivariate function. But, the idea is that if it curves upward, it's convex, downward it's concave. Some functions have convex and then concave like this sigmoidal function we'll encounter in social network influence models later. It starts out convex, at some point, so-called inflection point. It starts to bend downward. So, this part, the second derivative is positive. In this part, the second derivative is negative. This curves up, this curves down, this part is convex, this part is concave. The whole function, therefore, is neither convex nor concave. Now, what kind of function is both convex and concave? You say, wow. That means you need the first, the second derivative to be both non-negative and non-positive. The only possibility is that it is exactly zero, that means it has no curvature. That's the only way to make it both positive and negative. So, no curvature means it is a straight line. It could be pointing up or pointing down depending on the first derivative, but the second derivative is zero. In other words, linear or affine function is the only function that's both convex and concave. Now, these pictures are drawn on 2D, therefore, they're talking about a single variable. But, it's a multivariable function, now we have to talk about second derivative slightly more evolved. Recall your multivariable calculus course. Assume that you still remember, or can quickly brush up your memory about that. If a function, let's say, of just say two variables, once you get hang of that, it's the same as many variables is quadratic. X squared plus xy plus y squared, okay? Xy are single variable but is a function of both variables. Now, what is a second derivative of a, of a multivariable function is not a simple expression anymore. It's not a number, it's a, it's a matrix. With partial derivative of this function with respect to x squared and with respect to y, twice, okay? And, with respect to both x and y, with respect to both y and x. This is the matrix. You have to take partial derivative with respect to x and y. And in general, we say, given a function f of variables x1, x2, dot, dot, dot of xn. If we look at a partial derivative, will direct to xi, and then respect to xj, this is the ij-th entry of a matrix. That matrix is called the Hessian Matrix, or in short, the Hessian of the function. So, in this particular case, it is very simple. We take directive twice with respect to x, we get two. Back to y, we get two. With respect to x and then y, or y and then x the same thing for us is one, okay? So, this two, one, one, two is the Hessian of this particular function, and this is the general definition tight here. The ij-th entry of the matrix is partial f, partial xi, and then partial f, partial xj. So, how do we make sense out of saying that a matrix is positive? It's not that the entries of the matrix are positive, but its about whether this matrix eigenvalues are all positive or non-negative. And, you can check that this matrix eigenvalues are all, not negative. And therefore, this function is a convex function. But in general, we say, a function is convex if it satisfy the secondary test where the Hessian matrix that's defined is so called positive semi-definite. And, a matrix being positive semi-definite means all the eigenvalues are positive or non-negative. Now, you may wonder what if my function does not have a second derivative? If Hessian matrix doesn't exist, some function are now smooth enough that you can take derivative twice, or even once. Well, it turns out that, that's fine too. I can have a function with a, like sharp corner, okay? At this point, we don't have a second derivative. But, we can still define a function as convex or not and this will be a convex function. However, since all our functions throughout entire course are smooth enough that the Hessian matrix exist, we will just follow this definition, okay? So, minimizing a convex function, which we know how to detect now, subject to a, a convex set is a convex optimization, and least square is a special case because least square minimizes a convex quadratic function subject to no constraints. So, finally, we're done with the baseline predictor. And now, we can move on to the proper subject of today's lecture, define