In this section, we'll move on to discussing our next machine learning model decision trees. So let's go over the learning goals for this section. In this section, we're going to cover a brief overview of the classification problems that we discussed earlier. We're going to go over the actual decision tree classification algorithm and how it works. We're then going to dive deeper into how the splits are actually performed when we do decision trees, using both entropy and information gain. And then, finally, we're going to discuss how to regularize our tree, using something called pruning in order to address overfitting. Let's start with a brief review of some classification algorithms we've discussed thus far and give a brief overview of their pros and cons. So with K-Nearest Neighbors, the fitting is fast since with KNN the training data is the model and there's going to be no computation needed at fit time. Again, the training data is the model, so there's no extra computation needed, you just have it fit as soon as you call fit. But predicting the class for a new record can be very slow since, as we discussed, it involves identifying the K-Nearest Neighbors for a certain record. So we have a lot of different distances to compute, so this can really slow down the algorithm. And then our decision boundary is very flexible. So most likely, not some easily visually recognizable shapes, such as, say, a straight line that we'd get with a linear support vector machine or the linear logistic regression. Probably something a bit curvier, like we see here to the left. Now moving to logistic regression, with logistic regression it's going to be learning a bunch of parameter values, so beta-naught, beta 1. So fitting is going to involve solving equations and running iterations, etc., to find those parameters, and that fitting process can be very slow. But predicting, on the other hand, is just a short sequence of multiplications, additions, and exponentiation depending on what your parameters are. So it's pretty fast to actually predict the value. And then our decision boundary is going to be linear. And we also, again, introduced the support vector machines, which are linear classifications that are either simple and linear. So pretty simple boundaries, linear and probably fast to compute, or require the kernel trick to come up with a nonlinear classification, which will take a lot longer to actually fit the model. So now, say, we're going to give this example to move to decision trees. We have this data related to customer turnout at a tennis facility. And for our upcoming discussion, suppose you are going to be the administrator of this facility. And in addition to predicting the number of customers, you want to evaluate the impact of various drivers on player turnout. Such as outlook of weather, where we have this ordinal value of sunny, which is greater than overcast, which is greater than rainy. Temperature, with the ordinal values of cool, mild, or hot, with hot being the greatest and cool being the least. We also have humidity levels and strong or weak wind. So using these features, we want to again predict whether or not customers will play tennis at our courts. Decision trees will seek to split up the dataset into two datasets at every set, for which the decision is going to be easier, and then continue to iterate. So this purple node asks the questions, is temperature greater or equal to mild? Remember, we're working with an ordinal variable, so it's mild or greater would be mild or hot. That question then divides the original training dataset into two subsets. And hopefully, these two sub-datasets contain more information in regards to the target variable to predict. So that we can say, if temperature is less than miles, then it's most likely that customers will not play tennis. And if temperature is greater than or equal to mild, then customers will be more likely to play tennis. So that's the idea, we split up our data, and hopefully we have a cleaner split between those that play tennis and those that do not play tennis. Now, the circles where we asked each one of our questions are going to be our nodes. And the circle where we reach a decision, yes or no, are going to be called the leaves, And those are our leaves here below. Now, there's no reason to ask just one question. We can keep asking question after question to further segment our dataset. The depth of this tree is now up to two since we've split our tree twice. And the decision tree has categories in the leaves. So categories here being will or will not play tennis. And it'll use the majority class left that at those final leaves to predict the class that followed the steps down this decision tree. So it's going to be a classification algorithm trying to classify according to the majority class here, either played tennis or did not play tennis. However, this idea can be used in predicting quantities instead of classes as well. And rather than decision trees, it will be called regression trees. So here's an example of a regression tree. Here the inputs are going to be numerical values for slope and elevation within the Himalayas. And we're going to want to predict the average precipitation, which is going to be a continuous value. So we're no longer trying to come up with a classification but rather a regression and trying to predict some continuous value. We'll use the same idea as we did with classification. At the node, we're going to ask this yes or no question. Here we see the question is is the elevation less than 7,900 feet? We use that to split our dataset into true or false. And that's going to be a split of our original dataset into two smaller datasets. And we can again ask another question or make a decision. Here we have slope less than or equal to 2.5. And then we have, at each one of these leaves, we're going to have the average value for the remaining of that subset. So we keep splitting up our dataset into smaller subsets. We will then have the outcome variables, which are going to be continuous values for each one of these subsets. And then with that subset, we can average out within that subset what is going to be the average precipitation, and we can use that as our prediction. So to see how this looks in regards to a two-dimensional graph where we have maybe just one value. So here our x is 0 through 5 and our y is going to be -2 to 2. And we're trying to predict using the continuous values on the x axis what the values will be on the y axis. Since there can ultimately be a certain number of leaves with binary splits depending on the depth of our tree, so we saw a depth of our tree increase from 1 to 2 before. The possible outputs of a regression tree are going to be bound by the number of splits that there are. So those are going to be the number of values that we can have. So we'll not get an output of a linear function that can basically spit out any value but rather limited to whatever amount of leaves we have at the end of our tree. So here is a regression tree of depth 2. That would spit out four different values, and we project that onto the y axis, ignoring each one of the jumps. And those are going to be the four values that we predict given our different values of x. Now, increasing the depth of a tree, one can allow for more possible values. The bigger the depth of the tree, the more different average and different subsets you're going to be working with. And that could be a good thing or a bad thing. This new tree of depth 5 seems a bit overfit to each one of our data points, so we may be overfitting here. So we want to find that right balance of the right depth so we don't overfit our data. In the next video, we'll begin to dive under the hood into how a decision tree is actually built. All right, I'll see you there.