So first look at the output which we already got, what is the output telling us? The output is telling us what is the fit. So first of all it tells you, here is the model. So the intercept is 23.78 approximately minus 0.387 times the crim. So this is your equation of the line which we expected 25 and minus half. It also tells you something, it gives you a t-value and the p-value. For us, remember, the larger the t-value in an absolute sense and the smaller the p-value, the more you can trust that these coefficients are different from zero. For example, a t-value of minus eight and a p-value of 10 to the power of minus 15, that's what it means e minus 15, means that the probability that the true value was zero and you got a value like minus 0.387, has a very very low probability. So you can sort of say that we have reasonable confidence that the equation of the line is not flat, the slope is not zero, and the slope is negative. The second thing that output produces is something that you might already know from your statistics courses called r-squared. R-squared is basically what fraction of the variability in the data which is explained and that's why it's explanatory, right? What fraction of the variability is explained by crime rate. So it's saying, I'm able to explain 16 percent of the variability in the data. The same thing is reported out here at the bottom, in a thing which is called fancifully analysis of variance. What it really means is the total variability is the sum of these two. It is the sum of 4,557 and 23,366, so the total variability is about 27,900 approximately. What do I mean by variability? I mean by the deviation from the mean in the response variable. That is you take each response variable and subtract the mean, square it and add it up, it is that variability around the mean. So that variability is out of magnitude of 27,900. Our model is able to explain 4,557 of it, it's not able to explain 23,366 of it. So therefore, we are able to explain 16 percent, and where do I get 16 percent? I get 16 percent by dividing 4,557 by 27,900 and that's what it means. So what it's basically saying, that the line is able to capture just 16 percent of the variability. Is it good? Some people will say, it's good enough because if I have lots of houses I'm buying, this is great because of the average. I'm doing better than not explaining the value of the house. But if you're going to buy the house you say, it's bad. I need more information to buy this house. So it depends on good and bad, it depends on what you want to do with it. So another useful way of evaluating this model, if you notice, I have clicked "Evaluate" out there and I have selected this radio button which says Predicted Versus Observed, you will get a nice graph when I hit Execute and that will be on the Validation set, let's see what it does. So if you go to RStudio, and here it is, and if you look at zoom, it provides two lines, and I can make this bigger too. So it's easy to see here. What are these? So on the X-axis, we have the observed median prices. On the Y-axis, we have the predicted median prices. These points you see are just for this observed value, what was the predicted value. So for example, the observed value is 30 and the predicted value is maybe 27, 28. So there are two lines here, let me explain first the solid line which is going this way. This solid line is nothing but a line going through these points which you see on the screen. So it's just a simple regression between the observed and the predicted values. You can see that even though it's fitting quite well, there are points which are not falling on the lines. Ideally, you would like points to fall on either side of this line saying that there is no bias. But what this seems to be saying is, "I'm not able to predict very small values well or very big values very well." So that's one line. What's this line? This line is ideally what we would like under a perfect situation that the predicted value is equal to the observed value. So ideally, we would have liked the blue line here to be along the dotted line. So the closer these two lines are, it tells you have the fitness occurred. On top of it, if you look at the r-squared value, you can see an r-squared value of 1302. What does r-squared value says? Is what is the r-squared of this fit? In fact if you go back and do the fit on the training data, if you look at the predicted versus observed on the training data, the r-squared you get is 16.32, which is exactly the r-squared you got in your regression. So the r-squared of the predicted versus the observed values when the line fitting them, gives you a measure of how well you're able to fit the variation in the data. So what the 16.32 percent says, is what is the correlation between the 45-degree line and the fitted line? It says that's only 16 percent. So there is another way of understanding, what does this r-squared mean? So this is very useful in multiple regression, we'll see it in a minute. Last thing, and I will not spend too long on it. After you run the model which you have run already, here is the model, there's a button out there which we can spend days talking about it. It says Plot. When you plot it, it gives you diagnostics, we can spend a lifetime understanding. Basically, these diagnostics are looking at the residuals. One of the key points we ask is are they normally distributed? If you look at this plot, if they are normally distributed, you see these lines out there, that will be also along the straight line which is not. I will refer you at the end of this lecture to a wonderful book which has talked more about it. Our purpose is more from a modeling perspective, and I will leave it to you to analyze the quality of fitting in different ways. So going back what we've done so far, is looked at the ways of examining the quality of fit. How much of the variability of the data I'm able to explain? What is the line? How confident I am about the coefficients? I'm getting 23.6 and minus 0.38. Visually examining the fit by looking at the predicted versus the observed values and seeing whether the two lines are together a match. As you get more and more experience and do more much more, you can actually stare at the residuals and see whether these residuals, you will see have no pattern and that does become more important as we go forward. To close out the segment, we could spend hours and run the same regression. We could run the median value against the crime. We could run the median values against the level of industrialization. We could to run the median value regression against the tax. If you run it, you will get the estimates. In a classroom if I'm teaching, I would ask you which one you like. Basically, your answer should be, you don't know. Because some people think r-squared is everything, right? Not necessarily. But actually just looking at it, you think the degree of industrialization explains a lot of variation, 23 percent of the variation. Would you ask me, now, why don't you put it all together? So then in the next segment, we are going to then think of how to improve these individual models. We started with a very simple model, and what are the things we can do to improve this model?