In this scatter plot, I plotted homes with and without a garage in different colors.
Homes with garages are blue, and homes without garages are red.
There are many more homes with a packing garage.
Given the heavy snowfall from time to time in Boulder, this is not too surprising.
This graph also confirms that homes with parking garages tend to be more expensive,
as red circles are concentrated in the lower left corner of the graph.
To add PARKING.TYPE to the regression model, the regression equation will have
one more term on the right-hand side, which measures the impact of PARKING.TYPE.
As before, what had represented the predicted value of the list price?
b0 is the intercept, and b1 is the slope for square footage.
We also have b2, as the slope of PARKING.TYPE.
For the data set we have, we obtain that b0 is about -43,
b1 is 0.43, and b2 is -99.
Note that b2 essentially changes the value of the intercept.
When PARKING.TYPE equal to 1, or in other words, there is a garage,
we deduct 99 from the intercept.
When PARKING.TYPE is 0, the last term on the right-hand side vanishes,
and intercept stays at about 43.
Another way to interpret this result is that,
garage parking is worth about, elective $99,000.
This is quite surprising, intuitively, we expect that houses
with garage parking to be worth more, which we demonstrated visually before.
This cautions us in interpreting regression results, the underlying
intuitive result here can be contributed to what we call multicollinearity,
which is a very important concept in multiple regression.
Multicollinearity, also called collinearity, is a phenomenon in which two
or more predictor variables in a multiple regression model are highly correlated.
In this particular example, PARKING.TYPE and square feet are correlated
in a sense that, larger houses are more likely to have garage.
When you add PARKING.TYPE to the model, you can see the estimates for
different predictive variables vary considerably.
This creates two issues, in the extreme case,
when two variables are highly correlated, the numerical procedure that we use to
estimate the regression models becomes unstable, in the sense that,
a small change in the data can cause huge swing in the coefficient estimate.
That is certainly an undesirable feature of the linear regression model.
The second issue is that,
multicollinearity causes difficulty in interpreting modeling results.
We would like to attach an intuitive interpretation to the coefficient
estimates, however,
changing used estimates depends on predictors included in the model.
Therefore, certain interpretations may not be as reliable
as we are willing to believe.
This is not as serious an issue,
if we're only interested in the predictive accuracy of the models.