[MUSIC] Okay, so where are we? We're talking about classification problems, and supervised learning. We talked about rules, we talked about decision trees. And then we said a challenge of decision trees is that they are prone to overfitting. And so there's methods for avoiding overfitting in individual decision trees that we didn't really talk about, such as, you build them and then you can prune out parts of the tree and aggregate parts of the tree so that they become a little more general. But, instead, we shifted to another more general, and more powerful technique to avoid overfitting, which is to use ensembles of models. And so within the context of ensembles, we talked about boosting and in particular mention this algorithm AdaBoost, which is a meta-algorithm, meaning that it works on any machine learning application. It doesn't have to be about trees, but we talked about it in the context of trees. Then we talked about bagging which is about selecting subsets of your data to work on. And bagging itself relies on the bootstrap, which is a technique for re-sampling the same data set in order to derive estimates of important statistics. Then we put some of that together and talked about random forests which is an ensemble method for decision trees that has a couple of unique characteristics. Now, I went through this rules and indecision trees and ensembles approach sort of sidestepping, well postponing, the regression-oriented machine learning models. And there's a couple reasons for this. One is random forests do turn out to be a pretty powerful method, a pretty general technique, for a lot of things, not just in their statistical power, in there efficacy and practice, but also the fact they work on categorical attributes. The fact that they can handle many attributes, so big P, smaller n cases. The fact that there's different variations of them that you can experiment with. The fact that, frankly, they're kind of simple to interpret the results of, and also maybe even simple to implement, yourself. And hopefully you're armed with enough knowledge that you think you could make a go at this. So I think they're actually a pretty fantastic general purpose approach. All right, so let's consider a different method that's not rule or tree based. So that's Nearest Neighbor classification. So here, this is actually one of the simpler ones. It would be reasonable to actually present this first, but the reason I didn't do that is that it assumes only numeric attributes. And I think a lot of the intitutive data sets you work with, especially when you're coming from kind of a data processing background, data science sort of scenario as opposed to a pure machine learning, pure statistical background, the categorical data is quite common. And so defining Nearest Neighbor is sometimes possible, but it's not always very natural. But now that we've sort of covered the more general techniques that can handle categorical data, we can think about these. So Nearest Neighbor is dead simple, right? This is to say, plot your points in a space like this, where the red squares represent one class and the blue squares represent another, maybe survived or not survived in the Titanic example. And when you have a new point that you want to classify, just drop it into the space at wherever point the attributes tell you where it should go, and choose the class of the nearest point to it, okay? So what's the class of this green triangle? Well, the closest point to it is a blue circle, so we call it blue. And what's the closest point to this green triangle? Well, that one's a red square, so we call it red. Now, as you can see in the second case, maybe it's not that reasonable to call it red because perhaps the class boundary is sort of something like this. And this one's an outlier, this ones a mistake. And so perhaps we would misclassify this second green triangle. All right. [MUSIC]