Welcome to the demo. I'm on 3.1.1 demo imputing missing values. Our goal here is to demonstrate the imputing with the median of a feature column. So what we're going to do is basically find a column that has missing values. Find the median of that column. And then fill in those missing values with the median. Okay, let's first start off by doing our percent sign run. And what this is going to do is bring all of our helper functions as well as they go fetch our data into this notebook. Great. Now that's finished, we can go ahead and prepare data. What we want to be doing here in this notebook specifically, is predicting a customer's lifestyle based on their recorded metrics. So we are interested in a table that has user level data. In order to prepare table to do this, we're going to aggregate our ht user metrics table. Let's go ahead and run this cell. And we'll see we have the percent SQL. That just data bricks know that we're about to write some SQL within this notebook. I'm going to throw Python. Great. Okay, so we're going to select all of our columns and select the average based on a grouping of device ID, so that we have a user level table, and then let's do a select stars that we know how our table looks or what our table looks like, okay, great. So we've got columns, like average resting, heart rate, average active heart rate, BMI, average VO2, average workout minutes, number of steps, and then, what lifestyle, and this is the column that we're going to be creating. So we can think of all of these as our features. And then all of this is our time. Okay, we're going to convert our spark data frame over dependence and we're doing that because this data set is small enough that it could fit in memory with larger datasets. We want to be using spark DataFrame. So we're utilizing the spark commands to spread loads out across our cluster. Okay, also a quick side note, our data set is actually synthetic, which in most cases here throughout the lessons that makes a lot of sense to have synthetic data set. However, though, our pretty good task here is to actually fill in missing values of which our data set has them because no one created on this one. So what we're going to do is artificially inject some missing values into our dataset. In practice, though, you probably wouldn't do this you probably already have some missing data. So we're going to use NumPy to achieve that. We're going to do a 18% of our data being NaN values, okay cool. So now when we look at our data, we can see that every now and then we have some NaN values in our average resting heart rate, okay. Now we want to know, how many missing values do we actually have? We know that we have 18% because we actually just injected that. But in practice, we wouldn't actually know how many missing values we have. So, a couple of different ways to go about this, the first method is to chain together there is no method with the sum method. And the reason why this works is that some are that false in Python is equal to 0. So since NaNs are similar to false, when we use that sum value, what we're asking parents to do is to some is to sum all the non-zero values greater than 0. Okay, so you see here that we have 540 missing values, we'll need to deal with. Another way to do this is to think about the totality of our dataset, so we can calculate how many rows our dataset has or how many observations or ideas it has. And then count up how many non knows we have. This can be helpful a couple times if you're thinking about, how many total observations do I have. So we're going to do is chained together, or I'm going to put together a print statement. That's just going to let me know the number of observations that I have, along with this .info method. And the .info method will put together some several different statistics about our data set. First off we have, the column name and then next we have our, non no counts. So these are the values that are non no. And again, just to reiterate, we had 540, no values. So this is kind of the flip of that. This is the non-no values. And then we have the data type, which is also helpful if we need to think through, how we're going to impute some value. So in this case, we have a float. Need to impute this value with some sort of numerical value. Okay, great. So, method number three would be to filter down the data set. And this is helpful. Now if we need to know which rows in particular are missing, maybe we want to examine some sort of patterns. So in this case here, I can see that like, I have row 29 and 9 is missing value. If I started to group that together with all the missing values, I could get to know maybe, okay, there might be some that has to do with a particular device. Said deed that maybe they was spitting out name values make that this is kind of helpful in our df process. Now we know which one is missing, we know how many values missing, we can fill in those values in the property value or fill in those missing values with an appropriate value. The first thing we need to know is what our data frames features look like. And so of course, there's a couple different methods to go about this. Let's get some summary stats by running first to describe. And when we run describe in pandas, we return back the column names that are numeric. And with those columns numeric, we get a number of different stats. So we have like, the mean, the sand deviation, the minimum value, the maximum value, The median value and then some countdown information to in this case here, we're most interested in this average heart resting, heart rate column. And so we can see that our mean value is 62.33 and so on. And then our median value is 58.61. Okay, great. So method 2. Spark DataFrame also has a version of this right? So if we're using a spark DataFrame, we can use Spark SQL, we can do a select star from our table, and then do a summary on that. And then should we treat back something similar? And then over here on the far right we can see that spark Data Frame different from our pandas version includes the lifestyle by default. So we can see that we have a Knoll here where we have a mean value, right because this is a string There wouldn't be like a mean string. If we wanted to do that with pandas, we get added, included. Equals all the typo include. And then that would give us also like the the lifestyle column and then like some summary shots on that, but what also unincluded some of the values that wouldn't be numeric Okay, cool. Regardless of which method we want to use, we see that we have some missing values, right? And we see specifically here that we have a mean and a median that are different. This usually indicates a skew. And so in this case, we're going to impute the median, because it might be more like a common value, right? And then before we even get into that, We'd want to do a train test split on ready to set, we want to do a train test split, because our data set. If we take the median from the entire column of our data set, then we might be leaking information about our testing set into our training set. And so what we do is, we take the median from the training set, and then we're going to just fill in the missing values in the test set with that median that was obtained from the training set. Okay, so scikit learn has a built in method for that we can do a train to split. We're going to create our x, which is all columns except for our lifestyle column, and then our target column, our y column, which is just lifestyle column duplicator. X train X test wide training, white chest, variables. And we use the train test split function to split up our data. I'm going to use a test size of 30% and then just set a random state of nine Two one. Okay. Great. So let's check out now. The median, I can see that I have a median value of 58. Roughly the same as our, total data frames. And what I'm going to do is say that out to a variable and I'm saving that up to a variable so that I can, recall this later and I'll show how that.. That works. Okay, the way to fill in missing values, there's of course several different methods to do this. The first method is the simplest probably from a coding perspective. Maybe in from like a just grasping our heads around things. What we're going to do is create the column average resting heart rate. Where that was the average resting heart rate, where we fill in the missing values with that variable that we just created average resting heart rate median. So we saved up that variable right here. So that When it came time to do the fill in a we didn't have to write out all this code is a little easier to read on the eyes. Okay, we're also going to fill in the test set, right so we fill in with that value that we obtained from the training set, like we said a moment ago. And when we run this, we see that we do get a setting with copy warning. That let's pins know that our data frame is a copy of something, and then sometimes that can throw errors. So premise is kind of warning on the non fatal error. We can supersede. OK in the second method is to use a local look What we're going to do is save out the index or in other words, the row numbers where we had missing values. So we're looking for anywhere where is null is true. And then we're saving that index as a giant array. And we're doing that again for the train and the Set. There, we're going to use outlook and the index. So we're going to use now all of these row numbers. We're going to look up the row numbers in the average resting heart rate column. And then we're going to fill those in with the median value. Okay, great. Regardless of which method we choose, we want to make sure that we examine our missing data and make sure we just double checking our work. So I'm going to run that some and I can see here I have no missing values. I could also do the same thing on the test set there. So if I run I should also pop up zero. Okay, great. There are a couple other methods we can do obviously like we mentioned the very top of this notebook we could fill in the mean if we wanted to so we're just going to swap out mean instead of the median chained together with our missing data. And then fill that in with the same pattern that we have above using the fill in a. If you want to practice I've lifted in a cell here. We could try that out with lifestyle column there was missing data in lifestyle column. Okay, excellent. At the end of this now you should progress over to the knowledge check.