[MUSIC] We talked about different ways to measure the performance of our models and we've talked about the inherent trade-off between bias and accuracy. In other words, allowing the models to be complex enough so they can learn to predict accurately without letting the models over fit the training data. In this video, we'll talk about another set of tools that help us find the right complexity for our model and avoid overfitting learning curves. We'll also talk about using performance plots for evaluation of some kinds of models. By the end, you'll be able to use the appropriate tools to assess and improve model performance. In previous videos we saw how to use held out test data to assess regression and classification qualms performance in predicting labels on unseen data. We build qualms expecting them to perform well on that unseen data. So we cannot test the performance of our model on just any data like what we use for training and expect the performance number to be a good reflection of the models performance on unseen data. In other words, that number is not a good indication of the ability of our qualm to generalize. So we partition the data set we have into training test and maybe validation parts and we evaluate on the test part. Two key factors affect the generalization ability of qualms, model complexity of the learning algorithm and the size of the training data set. If our model is too simple, we could be underfitting. This means the prediction error in both the training and test data will be high as the model is too simple to be able to explain the variance or complexities in the data. On the other hand, if the model is too complex, it may overfit to the data used to train the model. This means that our model will be capable of explaining all the variance and complexities in the data it's trained on far too well, showing a low training error. However, the generalization performance can be very low because we also explain the variance and complexity of the particular random noise that so happened to be there in the training data. Seeing that generalization is a nuanced thing. We want to be able to identify the right complexity for our model to avoid both underfitting and overfitting. We can change the complexity of the model by altering various hyperparameters of the learning algorithm in training. For instance, increasing the degree of the polynomial features can be a way to increase the complexity of the model. If we were to plot the errors on training and test data for models of different complexity, you'll see how both errors go down up to some point. But in many cases after a certain point the training error continues going down, but the error on test data goes up. This point is where the model starts overfitting to the training data due to excessive complexity. Stopping before test errors starts increasing would be ideal, would likely be the qualm with the right complexity. The same thing can happen for fixed model complexity but increasing training time, particularly in the case of neural networks where they run over that training data many, many times. As you train more and more on that same data your learning algorithm is tuning its model more and more to fit the training data. You want to stop before overfitting becomes a problem and you can use a curve like this to diagnose the problem. The good news is you don't have to wait until testing to use these tools. That's where your validation data set comes in. You can plot the performance on training data along with the performance on validation data versus different model complexities or time spent in training. The performance on the validation set can act as a proxy for performance on the test data set. You stop at the correct point in complexity and the correct point in training time. On the other hand, the size of the training data plays a major role in model performance as well. If you have more data for training, the ability of your model to generalize almost always becomes better. If we were to plot the learning curves of the training test data for different training data set sizes for fixed model complexity, we can see this. The model will have low training error when we use a small training data set. However test error will be really high because it hasn't learned much about the variations in the data distribution to be able to predict for out-of-sample instances. In other words, the data points the model hasn't seen before. However, when more training data instances are used, training error increases. This is because the variance in training data has increased while the model complexity was fixed. However, test error goes down because the model can now generalize better. Another type of learning curve that specifically measures the performance of classification models is the ROC or Receiver Operating Characteristics curve. For binary classification, when you have a choice of threshold as in the case of transfer functions over probabilities, the ROC curve lets you see the trade-off between precision and recall. It illustrates the capability of the classifier to distinguish between two classes as the class discrimination threshold varies. This is achieved by plotting true positive rate against false positive rate. We saw in our previous videos that recall is the true positive rate and false positive rate can be considered false alarms. A point in the ROC curve represents the classification model with a specific threshold setting determining the class and the ROC curve represents the collection of such points. The best possible classification model with the right threshold setting would give a point in the upper left corner or the coordinates 01. This essentially means no false negatives and no false positives. On the other hand, a random guest such as a fair coin flip would give a point along a diagonal line which goes from the bottom left to the top right. Any point which falls above this diagonal line represents a classification better than random guessing. While the ones which fall below represent classifications worse than random guessing. AUC or Area Under the Curve of the ROC curve represents the probability a classification model will rank a randomly chosen positive instance higher than a randomly chosen negative one. In other words, get it wrong. An AUC values varies between zero and one, where one means the classifier has excellent performance in separating between two classes and zero means the worst possible. 0.5 means the model doesn't actually separate the classes. Some of these curves are shown on the screen. So now you've seen several ways in which plotting learning performance can be used to evaluate the performance of your models. Contrasting training with test error lets you see the cost of complexity and know when decrease in training error is actually hurting you. For classification, you can use the AUC ROC values as a more nuanced tool to see the trade-off between precision and recall. Happy model evaluation.