So we now come to very interesting point in our investigation. How can I improve any model, not just that? Obviously, many ways and there's a few ways not included in this. We will talk about it soon. But with immediately there are some easy things I can do because I have got a lot of data. I can say, "Hey, can I add some variable?" I can't remove any of course. Right? Another question I can ask is, can I transform some variables? Maybe you know it's not linear, maybe you know the residuals are becoming looking like that and so maybe a quadratic fit works. Maybe it's not a linear model itself. Maybe the line I want to draw as a edge stripe. So I can change the nature of the fit. Well, let's leave two and three out because, let's first add some variables and see which is the easy thing to do and let's see what happens to the fit. I think what we're trying to show as what is involved as we add more and more features into a model. How does it help you explain better, and what are your pitfalls for doing so? So what I'm going to do is try to fit a model of this kind. So we've fitted three individual models. We said, Okay. Median value as a function of crime, median value as a function of industrialization, and median value as a function of tax. How can I put it all together? Okay. So the way I would do it first of all is, let's just write the model. So what I am going to say is medium value as before is b0 plus b1 times crime, plus b2 times industries, plus b3 times tax. As before we would write that, but that is an error because we know we can't predict these things accurately. As before I can't write that for a given data point, median value, I, is equal to b0 plus b1, times crime, in zone I, plus b2 times the level of industrialization in zone I, plus b3 times the taxi in zone I, plus errori. I will write it like this. So basically, again want you to think off that there is a model and then you apply to data. When you apply to data you get an error, and that error we get at the Ith point we call ourselves errori. As usual we want to minimize the sum of squares of errori. That's the model as far as we're concerned. So there is no big difference between what we did before, and what we're doing now. In Rattle, what we didn't do is go back to your data, in your data now, make sure that we select all the three variables. We got crime, we get the dignity of industrialization and we've got the tax, we've got three variables now in the model and we say, execute. So it knows I'm trying to fit a model with three variables. No big difference. Go to model, go to the linear model, no difference. Execute, no difference and it gives an output. I just want you to remember that if you use the 70/15/15 partition and you use the random seed 42, your output should look the same, if you do it yourself. So let's go back to the output itself. If you notice, the output you got, should be exactly this. How is it any different? Well, there are two ways in which it is slightly different, I just want to emphasize. This is your model. So this model says that median value is 30.811, remember it's in thousands, minus 0.186 times crime, minus 3.49 times industrialization, minus 0.009 times the tax. So that's your linear model. It is 30.8 approximately minus 0.19 times the crime approximately and so forth, minus 0.35 times the industrialization, minus 0.009 times the tax. You may be wondering, is that anything that increases the price of house? Everything seems to be dropping it. The second thing as before we look at the P-values and the P-values, remember as before, big P-values in absolute sense means you can believe these coefficients are significantly different from zero. A small P-value says that that's a probability by accident it was really zero and you saw these values. So small P-values means you can trust this estimates more, large T-values means you can trust these estimates more. As these values are pretty good, you can probably trust them quite a lot. Now, the big difference is here. The three variables together explain 29 percent of the variability in the data. Remember each of those variables were explaining only 16 or 20 percent. Now, together they explaining 29 percent. You tell me if think about it the more variables we add, the better will be the explanation. So as we add more and more variables, the R-squared value that you see is going to go up. Is it good? Well, there is a way of penalizing more and more variables, and that is the adjusted R-square. What the adjusted R-squared does, is it penalizes the fact that you now have three variables, now if you add four it will penalize it. So the adjusted R-squared will be smaller than the multiple R-squared and it's a penalty because we want the model which is simple and elegant and to avoid over fitting the data. I just want to again show you this table out here. It actually breaks it up for you. It says remember the amount of variability in the data was about 29,000. That was the deviation from the mean. Now it says, what I can't explain is 19,700 but each of the other variables are explaining some part of the variability in the data, right? Now, you will see that crime is explaining about 4,557, industrialization 3,233, tax 339. So the amount explained by the total radiation that data is called R-squared. It's the same idea. Okay? No big difference. One last point. If you go back and look at the coefficients which you got and which is interesting. These are the coefficients you got when you ran the individual model, minus 0.38, minus 0.64 minus, 0.02. When you run them together the coefficients you are getting is minus 0.18, minus 0.34, minus 0.009 of half the segment they become of a significant, note. What it means is, when these variables are acting alone there were also acting on behalf of the other variables. So the coefficient is larger. But now that the variables are together each variable is now acting together but you're separating the effects into three different pieces. So this is called a partial regression coefficient. Given other variables, how much does crime explain? Given crime and tax how much does industrialization explain? So this is an interesting point because what it basically means is, as we add more and more variable in the presence of a lot of variables. Maybe these coefficients becomes so small that our particular variable stops explaining much about the variation in the response. I don't want to go beyond that. There are ways to see whether to drop a variable or add a variable, we'll go on further. So just take that what's big difference here is the adjusted R-squared. That's one thing I want you to take away from this. That is a way of penalizing more and more variables as we agreed. We'll see other ways of handling the same problem. Okay? One last thing. So you run the model. After you run the model it has a nice feature called evaluate. You do evaluate and we would evaluate it on the training set just to be good as we'll do it on the validation set. Let's evaluate it on the training set, and let's press execute and see what's the plot we get. If you notice now, you're getting better fit, the predicted values, right? The line going through them is closer to the 45 degree line. You're also seeing, to some extent, there are points on both sides of the line which probably is telling us we are not under or over fitting as much as before. On top of it, and this doesn't show it but this shows it, you can see the R-squared value of 29.12 percent here, R-squared value of 29.12 percent here, which is the R-squared value you got in the model. Okay? We could do the same thing. Instead of running it on our training set, we could have done it on the validation set and execute, and gone back and seen that also it shows R-squared on the validation set of 22 percent. You can see the lines have moved closer. In fact, comparing to the single-variable model, you can see here in this plot that this is the predicted versus observed when we are regressing only one variable and here we were doing a regression against three variables. You can see it side-by-side that things have improved. One could spend a lifetime understanding these regression models but what I want you to take away from this is very simple. Look at the data. Look at simple models, understand whether they make sense, understand whether your explanatory variables correlate well with the response variables. Try to add more variables cautiously, making sure that every time we add we are trying to improve effect. So we're going to see this in the next segment.