Hi, I'm Trent Buskirk. Welcome to our example where we're going to be looking at measuring the quality of an analysis of gathered data. Hopefully, you just were able to see the lecture on how we measure the quality of an analysis of gathered data where we introduce metrics, sensitivity, true positive rate, specificity, true negative rate. We talked a little bit about how machine learning models can be useful with gathered data. Now we're going to walk through a specific example where we're going to calculate confusion matrices and then derive some of the metrics that we've already discussed in the previous video. For this particular example, we're going to make use of data that has already been modeled. We have three different machine learning models that have been applied to a dataset to predict whether or not a student is going to drop out of a course. Our outcome measure is 0 or 1, so 0 represents no dropout from a course and 1 represents dropout from a course. We have a binary outcome variable. We're going to be looking at classification models to predict whether or not a student has dropped out of a course. The models that we're using are based on a training dataset of 2,000 cases using administrative record data from the university that included information from various demographic and economic variables that you see here. Things like gender, race, class, whether you're a freshman, sophomore, etc., the current credit hours that one is taking your cumulative GPA, the major college that you're in and then if you've had any prior drops, if you dropped a class before. The number of hours with a B or better that you've completed in your degree program, then whether or not the student lives on campus or elsewhere. These variables went into the machine learning models so we're not going to talk about how to make those models, we're really going to talk about how to measure the quality of them. We have three different machine learning models that were used with this 2000-student dataset and we have developed predicted outcomes for each of those models and that is going to be under our model-predicted test data tab. We also have the actual value of whether or not a student has dropped out or not. We've created these models using the 2,000 cases as part of our training data and we're now going to evaluate the model quality using the test dataset that has 100 students that we've kept out. Essentially, we have data from 100 students that are in our test dataset. We've applied each of these three machine learning models to create predicted values for each of these 100 students and the predicted values are either 0 or 1. We predicted they do not drop out or we predicted they do drop out. Let's take a look at the data under the ModelPredsTestData tab. You'll see here that these are just the ID numbers of the students. You'll see there's 100 of them in this file that represents the 100 students in our test dataset. The column B is labeled "DroppedOut". That's the actual true value, so if it was a 0, it means that the student did not drop out of the course. If it's a 1, it means that the student did drop out of the course using the administrative records on file for the school. Model 1 predicted values are included in column C. Again, a 0 here represents that the model predicts that the student did not drop out. If you have a 1 in this column, it means that the model is predicting that a student is dropping out. Same thing for model 2 and the same thing for model 3. Remember in the previous lecture, we talked about how analysts might be recommended to see how consistent results are across many different models using the same data. This is an example in that regard. Here, I've got three different models that I've run predictions on. I'm interested to see if there's a clear winner among these three. I'm also interested to see how consistent the model performance is across these three different models. I might look to choose the model that has the highest level of accuracy or the highest level of sensitivity or specificity among the three models that are available to me. Being able to assess the model quality of an analysis with gathered data is really important. In this case, the analysis has been completed. The goal of that analysis was to make predictions of whether or not you're going to be a dropout student or you're not going to drop out. Now we have these predictions and using these along with the true value that we know to be accurate from the administrative records, we are going to figure out an assessment of the model analysis quality. Specifically, let's go back to the first tab we're going to look at. What we want to do in this example is we want to use the predicted classes for each of the three models, along with the true values that are in the dropped variable column. We want to estimate the overall accuracy of each of the models, the true positive and true negative rates, based on the 100-student test data. We're going to do this by computing the confusion matrix for each of these models, and then we'll use these matrices to derive these model quality metrics. I'm going to walk you through how to do this for model 1. You'll be able to reproduce what I'm doing for models 2 and 3. I'll go over some solutions with you and we'll talk about how to interpret these metrics as well as use them to make decisions as to what model we might want to use. Let's begin. Now that we've familiarized ourselves with the data, we have to go back and now we're going to create these confusion matrices. Remember that the confusion matrix for a binary outcome is a two by two table. The rows of the table are the predicted values from a particular model. The columns of the table correspond to the actual values that we have from administrative records going forward. In order to do this, we're going to insert a pivot table and we're going to create the confusion matrix, from there, we'll be able to calculate the metrics that we were talking about for this example. First what I would like to do is just set this up so that we can label our spreadsheet a little bit so everybody knows where we are. I'm going to call this little area Model 1 and this is going to be our area to work on metrics for Model 1. Now what I'm going to do is I'm going to insert a pivot table and I'm going to go back to my original data to be able to pull all of these variables again. Now I'm going to click ''Okay'' and I'm going to insert. Now there's a pivot table that's here. Now what I need to do is actually fill the pivot table or create the confusion matrix. This is where we are going to make the confusion matrix. It will be constructed from the pivot table. The rows are going to be our predicted values for Model 1. Remember this is the Model 1 area. I'm going to go over here to the pivot table fields and I'm going to move, I'm going to grab the pivot Model 1 predicted values and I'm going to pull it down to be the rows of the table. Then for the columns of the table, I want to use the dropped out variable. That's the gold standard variable that we have. I'm going to pull that in as the columns. The rows are the Model 1 predicted values and I'm going to remove the planks. I'm not going to include planks. We don't have any blank, so this is just going to simplify things here. Now I have to populate this table. I'm going to pull in the Model 1 predicted values and I'm just going to put this in as the values category and I want to choose this as the count. I'm just counting up the records that correspond to each of these areas. Now what you will see is that there are 41 cases that the model predicts are zero or non-dropout, that are actually non-dropouts. There are 29 cases where the model predicts that they are dropout and they are actually dropout. Remember the rows of this table, these are the predicted values and the columns of this table are the actual values. Now I want to create the metrics. Let me set this up. We will now calculate the sensitivity, which is the true positive rate. We'll also calculate the specificity, which is the true negative rate. Finally, we'll calculate the overall accuracy, which is the overall correct classification rate. One thing to note here from the notes from the last lecture is that if we have a confusion matrix where we have values, we call these cells a, b, c, and d. These are the correct classifications from the model, they predict to be a one and they are actually a one. The sensitivity would be the ratio of d, which is the number of values that the model predicts is a one over the number of actual ones that we have in the data set, so this would be b plus d. The specificity would be calculated similarly, but we're focused now on the negative values or the non-dropouts. This cell would be the correct decision, that the model is predicting that these are not dropouts among those who are not dropped out. The specificity is the ratio of the correct re-classified non-dropouts divided by the total number of non-dropouts. Knowing this, we're now going to plug in these formulas over here to calculate these values. Let's come back over here. Now we'll just create a cell that will say this will be equal to the d element. Which is right here, this 29, divided by the sum of the first two of the column of ones. This plus this value and hit return. The specificity would be calculated similarly, but we're focusing on the column of 0s. This would actually equal this a cell here, divided by the sum of the a cell plus the c cell. The accuracy is the overall number of cases that were correctly classified divided by the total number of cases that we have in our test set. Correct classifications are, again, along the main diagonal. Fourty-one is a correct classification, we'll highlight that in green. Twenty-nine is a correct classification, we'll highlight that in green. The accuracy is simply the sum of the correctly classified units, whether reclassifying you as a non dropout or as a dropout. Divided by the number of elements in the test dataset, which in this case is 100. We get an overall accuracy of 70 percent. The true positive rate is 70 percent. Seventy percent of the actual dropouts were correctly identified by this model as being a dropout. Almost 70 percent of the non dropouts were also correctly identified as non dropouts by this model, by virtue of this specificity score of just shy of 70. This gives you an idea of how to set this up for the first model. You can do the exact same calculations for models 2 and 3 using the exact same procedures that I've just talked about here. If you take a moment to do that, you can check it on your own by what we see under the next tab, which is confusion matrices and metrics. You will see that I have a confusion matrices for the first model, the second model, and the third model here. You can see that the statistics are also calculated, or these metrics are also calculated for each of these models. Finally, in the solutions tab, and in also one thing to note here is you'll be able to see the formulas that are displayed here so that if you get stuck along the way, the formulas are provided for you for each of these steps. You can recreate what I've done for model 1, for models 2 and 3 on your own. Finally, I've organized the solutions here with the exact calculations that I would do by hand showing you exactly how these confusion matrices were interpreted or used to identify elements of each of the formulas that we've already gone over. If we take a step back, the accuracy of the first model would say that about 60 percent of cases in our test dataset were correctly classified as either a dropout or a non dropout. Eighty-three percent of cases in the test dataset model 2 work correctly classified as either a dropout or a non dropout. Finally, 71 percent of the cases in our test dataset under model 3 were predicted accurately to be either a dropout or a non dropout. If we look at the sensitivity, again, this is the true positive rate. Among the dropouts, model 1 was able to identify 71 percent of them as dropouts. Model 2 was able to do that for 78 percent of the dropouts, and model 3 was able to do that for only 59 percent of the dropouts. Finally, if we look at specificity, this tells us that model 1 was able to identify 69 percent of the non dropouts as non dropouts. Model 2 was able to do that for 86 percent of the non dropouts, and model 3 was able to do it for 80 percent of the non dropouts. If we think about these three measures of data analysis quality for gathered data and we take it a rather holistic view, rather than focusing on a single metric, I might want to look at whether multiple metrics are giving me the same message. It looks very clear to me if I look at these metrics on every single metric, accuracy, sensitivity and specificity, model 2 is better than either model 1 or model 3. Given that scenario, I might very quickly say that model 2 has the highest level of measured quality for the analysis of this gathered data compared to either model 1 or model 3. I might choose model 2 as the final model that I would move forward in reporting. I hope that this example has clarified some of the calculations around the metrics that we could use for measuring the quality of an analysis in gathered data. I encourage you to work through this example yourself and make use of the formulas and the solutions that are provided here to help you make sense of these metrics and for you to be able to compute some of them on your own. Thanks for joining us through this example. I hope that you learned a lot looking forward to seeing you in our next segment.