Welcome back to the final video for this notebook. In this final video, we're going to show how high dimensionality can end up affecting model performance. And with that, I want to quickly touch on again how we can fight the curse of dimensionality. And two different methods that should immediately come to mind, as we discussed them in the intro to this course, are going to be feature selection. Where you'd use domain knowledge to reduce the number of features, given the ones that you think are already informative. As well as feature extraction, and with feature extraction, you're going to use dimensionality reduction techniques such as PCA. Which we'll learn later on within this course, to transform our raw data into lower dimensionality data that will preserve, hopefully, the majority of our variability in that original data. And again, we will touch on this later on in the course. So here we're going to show, creating play datasets, how high dimensionality will end up affecting our model performance. In order to do so, we're going to import many libraries here. We're doing a classification problem, so we need our train test split, we're going to need our standard scalar. And then something new that we haven't seen yet is we're going to use this make_classification function, which is available in sklearn.datasets. And that's just going to create a toy data set with a certain amount of classes. And I'll show you this in practice in just a second. And then we're going to use our decision tree classifier to ultimately predict the class. So, first thing that we're going to do is create our classification data set. In order to do so we're using this make_classification function that I just introduced a second ago. I'm going to show you a bit about how these arguments work. So I'm going to create a cell above. And of course, first we're going to have to import that library. We're then going to use this to create our x and y, so now we have our x, And our x is going to be this two dimensional data set. The default is that there are going to be 100 samples, you see here that has 100 samples. So if we were to run x.shape, we would see that we have 100 rows and 2 features. Those two features are going to be decided because we said that we want the number of features equal to 2. We're saying that the number of features that are redundant that don't give any extra information are going to be equal to 0. If you imagine, often we will have redundant features such as when we discussed, you're talking about age versus whether or not they're a senior. There will be a bit of redundancy built in. The number of informative features will be the rest of that, so we're saying all of our features will be informative here. And then the number of clusters per class will allow us to spread out that data, in a way. Now I'm going to plot each one of our different classes. So along with that x we have which class each one of those values belong to. In order to look at both of those, we're going to use a scatter plot. And we're going to scatter our x, such that y equals 1, and on x we want first our first feature. Then we're going to want our second feature, and then I'm going to use this again to create another scatter plot that's going to be our y equal to 0. And everything else the same. So here you see our two different classes. They're differentiated fairly clearly. And just to show you how different things work, if we were to say the number of clusters is equal to 1. So we don't have separate clusters within our different classes, then you see that they're very clearly separated. So adding on this extra cluster allowed them to be a little bit closer together as there are going to be separate clusters for that class. Also to go along with that, if instead of having both of our features being informative. If we only had one of our features as informative, then we would see that the other one is redundant and everything's along one axis. So we don't create this separation and there's no use in one of those features. One of those features don't essentially add any extra value, or combined they don't want any extra value. So that's how the make_classification works. We saw that original plot of what our data actually looks like when we're working with two features. We're then going to add on to that a bit of noise. So, we're just going to use our random state here, we're setting a random state equal to 2. With that object we can call range.uniform. And we're going to be adding on 2 times a bunch of random values of the same shape as x. Sorry, adding 2 x, something the same size as x, so add to each one of the individual data points within our 100 by 2 array. And we're going to add on values that are between 0 and 2. So the default for a range.uniform will be values between 0 and 1 and we're going to multiply that by 2. So it will be values between 0 and 2. We're then going to scale our data so that it is all between 0 and, well, so that the standard deviation. The mean will be 0 and the standard deviation will be 1. And now that we have our data, resetting x to the standard scalar version of itself, we can split it into our x_train x_test, y_train, and y_test. So we have our toy data set, we can then use our decision tree classifier and run that on x_train and y_train and see what our score is for x_test and y_test. And we see that our score from this two-feature classifier is 0.875. Now we're going to run all the same steps. And what's important to note here is that the number of features is obviously going to be going up a hundredfold. But with that we're also ensuring that each one of those different features are informative. So we're not allowing for redundant features here. So we still have all of our features being informative. We're going to run through the same steps, otherwise everything else is the same. We are going to set our range again, setting that random state, adding on that extra noise 2 times the uniform value. Run through the steps of setting up the training set and the test set. And then we're going to use, again, our decision tree classifier on our standard scale data. Check our score on our test set after fitting on our train set. And we see that our score goes all the way down to 0.425. So we see that adding on additional features, even if they're informative, end up leading to worse model performance due to the fact that it'll very heavily increase the amount that'll overfit to each one of these features. And something to note along with this is that, as we mentioned during the lectures, you should, if you're going to have more features, try to also have more rows of data. So if we had enough rows of data, maybe we can counteract this problem. But generally, if you're going to have a certain amount of rows, The less features you have, the more informative each of those features can be, less likely you will be to overfit. We're then going to, rather than just looking at 2 and 200, loop through values between 50 and 4,000 and run through each one of these same steps. So all the steps are going to be the same. We're calling for num and np.linspace starting at 50, that's our increment, up till 4,000, counting by 50. And we're just going to continuously pass in that num for our number of features as well as by setting number of redundant equal to 0. By default, all of them will be informative, and then everything else is the same. We can get each one of our different scores as those are going to be appended on to this empty list. We run this and this will take just a second to run and then we can plot that as well. Just looking across each one of the numbers of different features and seeing the classification accuracy as we increase the number of features. Now by chance, some of these can be a bit more accurate, but adding features, in general, can very much lead to reductions in accuracy. Not all the time, but it very easily can. So in this example the accuracy is highly volatile and the number of features, and increasing features, again, can reduce that accuracy. Additionally, in our example, we specified that none of the features are redundant. And in practice when you have this many more features, generally speaking, you will almost definitely have redundant features. And for example, if we were predicting customer churn as we've discussed throughout these courses, using a variety of customer characteristics. We may have collected extensive data, say, for each customer that we have across many dimensions. And this would be an example in practice of high dimensional space, which can make it difficult to apply unsupervised learning methods directly. And potentially will lead to issues within this curse of dimensionality as we try to create these groupings. So that closes out our video here on the curse of dimensionality. With that, we're going to go back to discussing different types of groupings, different types of clustering algorithms, starting off with agglomerative hierarchical clustering, and I look forward to seeing you there.