Building an exploratory training set. In this lesson, our goal is to practice bias awareness in cleaning and parsing our training data. How do we approach a predictive model from scratch? Remember that we have a fairness now in three different acts. The first is pre-processing. We need to remove bias before it even enters our training model as much as possible, and this is done by de-biasing our collection of data, parsing our data while watching for exclusion bias, which means we do not want to remove attributes that could be later on, and we want to select attributes for modeling with fairness in mind. We basically will take our best guess at how the attribute selected will work in the domain and the algorithm, and then we can adjust later on. Our second act is the in-processing unit, where we will start to build adversarial models and auditing systems to make sure that our attribute selected did a great job. Then finally, post-processing. We will judge our fairness score, we'll try to build an accurate and fair model based on fairness score, and we will adjust our results as needed. Let's take a look at this example. At Fairness Bank, they want a fair set of training data and attributes to increase their overall fairness scores. So they've brought us on as a research team to build a better predictive loan model for them. In a real world scenario, it would be more likely that we would have to be the ones as researchers to champion for these efforts, as most major companies aren't aware of the rampant bias present in machine learning. Let's go through this example together and we would start when it comes to collecting data. The first thing to note is that we bring all of these different cognitive biases to the table as we learned in a previous lesson. So we need to de-bias our team as much as possible as the more perspectives we have, the better our data will be. An example of this is to bring one domain expert in loans, one fairness expert who is only focused on having the lowest amount of bias, and one Machine Learning Scientist who is focused mostly on getting the model accurate and correct. The next thing we're going to do is we're going to review potential data sources. Let's take a look at three different options the bank has provided us here, and keep an eye on the bias potential. The first is historical credit score to loan repayment odds. The first red flag is that credit score is a very big notion in our society. It is important, but it could introduce some sample bias. The reason is that when we rely on another calculation such as credit score, we are inheriting the biases that are present in that scoring process. So unless we are as a team, willing to go and unpack all the different biases present in credit scoring, we are probably better off moving on to a different set of data. The next is historical income to loan repayment odds. Income would be a less biased weight, and you might say it's a direct attribute. Money you have is money that could go to the loan. This could be a scenario where our domain expert tells us, you know what, historical income is great, but they may point out that income after expenses, such as rent and bills, would be a more accurate view of what the applicant actually has in their bank to pay the loan with. So this is a good attribute potentially, but it could use some work. Then finally, historical occupation data to loan repayment odds. Occupation data seems like it could be a good attribute of the fact. Because obviously, if we have their occupation, we can predict how others with that occupation pay back loans. But it might introduce some sample bias as well because it does not reflect the complete environment of jobs. For example, doctors and lawyers have low variance in their fields. They have had very similar fields for decades and decades. But there are certain fields that have massive swings based on societal shifts. Let's take the example of a professional video gamer. There might not be much data on this profession, or if it does, it does look very different in, let's say the year 2000, when it was just an emerging hobby versus 2020, where there are professional gamers making quite a bit of money in a very different environment. So we need to take all these different things into account, and basically, our outcome here is that we will focus on income and some of the different attributes around income. Now that we have de-biased some of our collection of data, we're going to move on to selecting attributes. Remember, our goal here is to balance attributes that will boost accuracy while reducing bias. This is where the domain expert and the fairness expert will square off quite a bit. Let's take a look at one of these sample rows in a database here with the different columns that we could potentially use as attributes. In machine learning without getting into too much of the detail, we are essentially looking at a database and we are picking these different column headers as things that our model should focus on to train the model effectively. Let's look at them one by one. The first is age, which could be useful as we could weigh years in their current job more heavily based on assumed years in the workforce. We could program in that the average college graduate, if that is a checkbox on the application, graduates at 22 or 23 years of age, so that the age subtracted that would give a indicated years in the workforce, which could work out as an attribute, but it also could be a protected class. Our model might discriminate against young people even if they have high income. Let's say someone who's just graduated from college, but they've worked really hard and they've landed their dream job at a big tech company, now, all of a sudden, we could skew our model against them, so we might want to introduce them as a protected class. Let's move on next to yearly income. Yearly income looks like a solid marker at first. But if we dig into why that is, it largely has to deal with observer bias because in our society, we tend to think of yearly income as how we think about money and salary so we have a preference for using it. But if we think about how loans actually get paid back, and this is where the domain expert could actually work well with the fairness researcher, we realize that if loans get paid back monthly, then what if this person is a freelancer or seasonal worker and they get 90 percent of their income during a 3-4 month period. In this case, yearly income might not be the best marker because there would be some certain months where the loan would be paid off well, and then other months where we would be in danger of defaulting. So let's potentially skip that attribute and go to the next one, average bank account balance. This is where the domain expert could say, you know what, this seems like a great attribute to focus on, because it turns out people with this amount in their bank account, tend to have a lower ratio of income to expenses. This is a good attribute and it seems not biased at all so let's keep that one in play. Next, we'll move on to years in current job. This is another standard marker of stability, typically in society, but it also introduces some prejudice bias. It's a cultural norm that we're historically at a job for five, 10, 20 years, and new trends toward the gig economy and freelancing could have people in new occupations, maybe every 4-5 months. We don't want to discriminate against those that are in the gig economy, so it may make more sense to remove this, and with exclusion bias in mind, we might just flag it and not include it in our model. That's a brief overview of some attributes that we could select for this model. What are the next steps? Well, now that we've selected our attributes and we have our data, we would move to the in-processing step, where we would build an auditing model that would test how good our attributes actually work when presented with all of this training data. We decide, you know what, we really want to protect that age group, we do not want to use years in current job, and we just want to focus on bank balance, so how good do we do? The adversarial model could give us some heads up and we might need to move back to the pre-processing step. Then once we feel good about our attributes, we then move to post-processing, where we can judge the fairness score, we can pick a model that actually makes the best decisions based on fairness, and even go to the bank and say, you know what Fairness Bank, we will let you pick which point on the Pareto efficiency line makes the most sense for your business, and that is again, the trade-off between accuracy and fairness. The goal here is to recognize that we have all of these three steps, but we need to be fluid between them. If we get a result we do not like in post-processing, we can move back to in-processing or even pre-processing. We want to make sure that we re-tune the model like you would dials in your car, and again, the model is not going to be perfect, but we need to make it as fair as possible.