I’d like to talk a little bit more about how you’re going to do you’re final project. You’ve got six possible inputs for your model. They include each individual applicant’s age. The years at their current employer. The years at their current address. Their current credit card debt. Their current automobile debt and their current income. What you're gonna want to do is try different combinations of these inputs. So if you just think of them as a, b, c, d, e, f. You can try each one individually and calculate the area under the curve. If they have a very low area under the curve, close to 0.5 then you probably should just discard them at that point. If they seem to have some positive discriminatory power then you could try combining them. So a plus b, a plus c, a plus d like that. You could try combining three of them. Just remember if you do a plus b, you might want to divide by two. or a plus b plus c, you might want to divide by 3 to make sure they fit within the 3.5 to negative 3.5 range for the standardized values. You can also try ratios of things and of course you can try weighted combinations 0 .7 times 1 plus 0.3 times another. This is up to you. And I've left this problem very open so that you can just experiment manually. Try cutting and pasting and just try different things and see what works. I feel like this is a very, very good way to get some kind of intuition for what's going on. When you're developing a binary classification model. Once you've got a model that seems to be working fairly well on your training set take that model without changing anything and try it on the next 200 individuals, the test set. If your model performs much worse as far as it's area under the curve, you've probably over fit the data on the initial training set. And you're going to go back and adjust your model so that the difference between how it performs on the training set and how it performs on the test set is not so dramatic. Once you've got a model that's fairly robust meaning it's preference does not fall off dramatically when you move to the second set, to the test set then I'd like you to take a look at the threshold that minimizes total cost. You may remember that this is the total number of false negatives times the cost per false negative plus the total number of false positives times the cost per false positive. And so in the problem we'll give you those costs and then you can find the threshold that minimizes the total cost. And you can report what your average cost per event is at the minimizing threshold. Then you're gonna want to try running the model on the second set on the test set without changing the parameters. So you need to use the same threshold. If your threshold is 2.6 or 1.4, you need to use that same threshold. Not the threshold that would minimize costs on the new data but the old threshold. Because you have to remember, this is to simulate a forecast when you wouldn't know the data already. So you have to use the old threshold and then you get an actual measure of what your costs are at that threshold. If that cost is dramatically higher than the cost on your training set then again you probably have over fit the training set data with an over complex model. And you're going to want to go back and simplify your model. It's often the case that problems don't emerge at the area under the curve stage. They only become really apparent at the cost minimization stage. So that's the first part of part one of our course project. You are then going to look at various metrics regarding your threshold, true positive and false positive rates of course, but also the positive predictive value and negative predictive value. And the information gained over the base rate associated with your model. The base rate is of course, what happens with the initial group where everyone is granted a credit card. So you can think of this initial group as where everyone tests negative. In other words no one is classified as a potential defaulter and you have a large number of false negatives. Those are the defaulters. And you can calculate the cost per event of using no model. So we want to compare information gain of your model against the base rate and cost savings of your model against the base rate cost. Then to make it a little more interesting we're going to introduce an alternative source of data, Agritopia which is a predictive analytics company that provides very accurate credit rating credit scores. These scores allegedly have an area under the curve of about 0.84 to 0.85 and they are for sale for a lot of money and your boss wants to know, should the bank buy these scores? What you're going to do is you're going to use a test sample of scores calculated on the test set to calculate the optimal threshold and minimum cost using the very simple model of simply using the ranked Agritopia scores. So this goes back to our cancer diagnostic where you simply have a single ranked set of scores. And you use those to predict the outcomes and set an optimal threshold. You will then calculate the metrics, positive predictive value, negative predictive value in information gain for the Agritopia scores. And I think what makes this problem really interesting is you can figure out how much the bank should be willing to pay for the Agritopia scores. Either if your model did not exist in which case the alternative would be simply the base rate or how much they should be willing to pay if they already have your model and the data that feeds it, the six input variables. This problem is intended to make a particular point which is that often when we do data analytics, we don't acknowledge that the current state of knowledge is the thing we need to improve upon. So Agritpoia would probably like you to believe that they should get credit for all the reduction in uncertainty over the base rate. But in reality they should be rewarded and compensated only for the reduction in uncertainty over the model that you've built which I'm sure is pretty good. Part two of our final project is based on multivariate linear regression and will require you to create a regression model for forecasting of a continuous variable. And then to include in your explanation and your model, a measure of your error and the confidence intervals of your estimate. What we're going to do, is we're gonna look at an interesting phenomenon in the credit card industry which is that often certain customers who are very profitable for the banks are those that tend to run up large debts. Really even sometimes more than they can afford and are making late payments and are struggling and often ultimately become defaulters. But it is possible over time for a customer that defaults to never the less be net profitable for the bank. Of course the opposite is also true. If you have a customer that scrupulously pays on time and pays minimal interest and maintains a very small balance. The bank is providing the service of the credit card and receiving almost no interest payments. But we're focused on the former scenario which is profitable customers that may nevertheless default. And what we're going to do is look at the bank's records over the three year period for the 400 individuals that we've studied in terms of default, no default. In terms of a new output or target variable which is their actual net profitability or net losses associated with that individual. What we'll find is that we need quite a different model from a default model to forecast the actual profitability of an individual customer. Of course, this model will not be perfect. And so what we're going to do is we're going to end up with a model where you'll be able to say for a particular input. This customer has a forecast present value to the bank of blank, $2,000 plus or minus at the 90% confidence interval, x thousand dollars. That's how we'll structure the final answers and this should tie together many of the important themes in our course. In terms of the need to make decisions with uncertainty remaining in a problem and the responsibility of the data analyst. And data scientists to quantify that uncertainty when presenting their recommendations to decision makers.