So, let's just run the data. So, once again, I'm going to take a shortcut. It's sitting right there for me, "NaiveBayes". You'll download into your folder. The good thing is that data is preloaded. So, let me look at it this way. It's a good practice because this is old data, then I'm going to clean all the previous data which I had, which is just wipe it out. So, we're back here. I'm going to start with the install, since we're installing the package. You want to restart R, you say," yes." Then it is installing the package. It is already installed but I'm just showing you what to do maybe on your machine. You invoke the library called "pacman." With pacman installed and loaded, you can use p_load, to both install and load many packages at one go. So, this is a nice way of loading and they're already there, so nothing has to be installed. Now, it gives you the dataset, 1470 rows and 31 variables, you can see that. You can also ask where is it coming from? It's already there because you said "Library." It is in the R sample library. If it was not loaded, you will not be able to access this data. Then it's sub-setting it right here. We're just choosing the nine variables. Remember including attrition. Then we're setting a random seed. Then we are converting variables from numerical categorical. So, it's easier in some sense. We are splitting as usual, creating a sample, and then using the sample to create a training set, and using that to create a test set. Then same thing on the attrition, we're creating the training set and the test set of attrition values. This is the only place where we use the Na�ve Bayes model. Now, first time we're running on four variables, so we run it on four variables. Then we're saying, we just called it and say, "what does it do?" So, we're just making a call and it produces tables. All these tables are only telling you the priors. Remember the prior values we calculated. You look at human resources, no, yes, are the two classes and it's telling you, given it as "no, " what will be the Education Field? So, if you look at the last row there, it's 1 percent human resources, 41 percent life sciences, 10 percent marketing, 32 percent medical, 5 percent other, and 8 percent degree. So, it is just telling you that given the person is not going to leave, what's the condition, the probability, they belong to these various classes? So, each of these values it is giving you, we can look at it in leisure, is nothing but the prior values you computed. Now it is computing the accuracy. You can see the accuracy, it shows is 68.59 percent. It also produces something called a confusion matrix, and we will look at the confusion matrix later on, but if you look at it this is the actual value, and this is the predicted value. So, out of the 678 people who would not leave, it's predicting correctly. But it's making a mistake and it's saying the people who will not leave. It's saying some will leave. So, we will see more about the confusion matrix in Session 8 but I just wanted to show you where it is. These issues also we'll study in session 8. Now, at the moment, we are only looking at accuracy. Now, we're computing the accuracy on the test set, and you will see it produces the same thing. It gives you an accuracy of 68.2 percent. Look at the confusion matrix, the numbers here are smaller because your test set is smaller than your training set. So, you will ask me, "What is accuracy?" Well, accuracy is "no" being classified as "yes, " or "yes" being classified as "no, " or the mistakes it's making. So, think of it, it has made 77 plus 27, 104 mistakes. 104 mistakes out of approximately 330. So, that is about 30 percent, and therefore the accuracy is about 68 percent. That's the way to think of it. It is the two values it's misclassifying. We will put labels on them, but as far as we're concerned, the accuracy is able to predict 68 percent correctly. Is it good or bad? I'll leave that to you, and we will study more about it. So, now in this part, we're using all the attributes portraying, and we tried to find the accuracy. We can see the accuracy has gone up. Gone up to 72.7 percent. Which means the additional variables are giving you another 2-3 percent. So, using the first eight variables is good enough as far as you're concerned. For the test set, it is even better it's able to predict 78 percent correctly. You can see in this little data, if you'll read this carefully, only 72 elements are misclassified out of 330, and therefore your accuracy is really sharp. Coming back, I will let you play with it, and actually will give you a dataset to work with. Hopefully, this gives you confidence in doing similar projects. I'm going to give you maybe a slightly more difficult exercise than I've done before because you have to go modify the script, so be careful save this script somewhere before you do this. The first question is, can you use the Real_estate data and perform Na�ve Bayes. It says yes or no. If yes, do you need to transform your input variables? Numerical, to categorical, or not. Then we are asking, and obviously the answer to the first question is yes, otherwise we wouldn't be asking this question. Perform Na�ve Bayes to predict the house prices and you have to transform. So, the answers for the first two questions is, yes you have transform. So, you just have to go replicate the steps we did in the script to transform. Use the Real_estate data, do Na�ve Bayes, and maybe you want to compare it with the KNN method. For employee attrition, visit the Na�ve Bayes. You do the nearest-neighbor. Find the best k. You have to transform the data, and compare it with how Na�ve Bayes perform. So, basically we are asking you to flip for the Real_estate data, do the Na�ve Bayes, and for the attrition data, do the KNN, compare the results, and report. So, in summary, I hope at least for human beings, simple rules are often quite powerful. But frankly, simple rules can sometimes lead to bias. So, it's better to use data to infer the rules. We can say, based on my experience we think, so in so saying that, we can seek the rules based on past data or past experience but a combination of both. So, that's the idea. Developed these rules. Once you develop these rules, you can use it for predicting for the new customer, for the new student, for a new email. If of course, you have too many features, most of these methods create some kind of problems on all data, and because the distance is a problem, or they are very sparse to finding a similar person, or a similar object, or a similar student, it becomes very difficult if the features are so long that I can't match one person to the other, and they all look the same. I have given you some readings and if you're interested, you can read a lot more to understand about this subject.