0:10

Hello. This lesson is going to

Â introduce some more advanced features in the Pandas module.

Â Specifically, we're going to look at how to select rows based on features and columns.

Â We do that by masking.

Â We're going to look at how to take

Â different columns from a data frame or even a single column,

Â and separate the analysis of the rows in the data frame based on those groups.

Â We're also going to quickly look at how do you combine or

Â stack data frames together either by extending,

Â or adding new columns to an existing data frame,

Â or by adding new rows to an existing data frame.

Â And lastly, we're going to look at a powerful concept known as "pivot

Â tables" which are prevalent in the spreadsheet world.

Â We're going to look at how to do those within Pandas.

Â This is all contained in the advanced Pandas notebook.

Â The start of this notebook repeats

Â some previous work in the introduction to Panda's notebook.

Â In this case, we're going to start by grabbing data from this website,

Â and we're going to specify an index column out of the data set itself,

Â it is the airport code.

Â We can then take a look at this data frame.

Â And as you can see, our index column is now the airport code.

Â And this data set has several columns.

Â And we can select based on those.

Â We can also look at what the data types for each column are.

Â The first four are objects.

Â They're holding string values which are variable length.

Â The last two are actual numerical columns.

Â We can also provide a summary of the numerical information in the data frame.

Â This data frame was created with a labeled row index,

Â so we can actually slice the data frame

Â by selecting the rows corresponding to those labels.

Â And, of course, we can use implicit labels,

Â and we can then select columns and rows by using

Â either explicit or implicit row and column indices.

Â Now, the first thing we want to do is talk about selecting rows based on column values.

Â And the way we do, this is with masking.

Â This is exactly the same way we did it with NumPy

Â when we were selecting rows based on values of columns.

Â And we call it a "mask, " because we're saying for

Â all rows where this case the state is equal to DE,

Â which is the abbreviation for the "State of Delaware,

Â " select those rows.

Â And we could see that when we execute this code, that's what we get.

Â And we can build more complex mask's.

Â Here we're saying if the latitude column is greater than 48 and

Â the longitude is less than minus 170, select the rows.

Â There are other functions,

Â one of them is very useful called the "Sample," which just randomly selects and rows.

Â And we can sort a data frame based on the index or by a particular column.

Â And these cells show how to do that.

Â Now, the next thing that I want to look at is actually applying

Â computations to a data frame based on the value in a particular column.

Â This is called "Group By."

Â Because what we're going to do is take a data frame and effectively divide it

Â into two or more chunks based on the value of a column.

Â To demonstrate this, we're going to use the tips data set again.

Â And the first thing we're going to do is let's group by time.

Â If you recall, time can take two values, dinner and lunch.

Â So, we create a group by object simply by specifying the column that we want to use,

Â and then we operate on this new data frame group by

Â object just the same way we did as if it was a data frame.

Â So, when we take the first two rows we get the first two rows for each of the groups.

Â If we take the last three rows,

Â we get the last three rows for each of the two groups.

Â We can also operate on that.

Â We can take statistical measurements such as the median or the standard deviation.

Â And as you can see, we get the summary representation for both of them.

Â Now, notice that it's actually applying it to all the columns,

Â all of the numerical columns not just one.

Â That's because we have applied it to the entire data frame group by object.

Â We can also apply a list of functions in this manner where we specify

Â a list of the function names and the aggregate

Â applies these two together to the data frame.

Â So, we have total bill, the mean standard,

Â and then for the tip,

Â the mean and standard and then for the size, the mean and standard.

Â We can also just call the described method,

Â and of course, it breaks it down,

Â the statistics by either the lunch or the dinner.

Â So, that's a powerful way to analyze a data set and see

Â the differences between the rows based on the values of the column.

Â And so, there's some other things that you can do

Â that are talked about in that particular section.

Â The next one is stacking,

Â where we might want to extend the data frame in

Â the column direction or in the road direction.

Â And this section talks about how to do that.

Â That's somewhat straightforward.

Â Here, what we do is we take our tips data set and we split it into two chunks,

Â what we call t r one and t r two,

Â and all we've done is said the first 200 rows go into the

Â first one and the remaining rows go into the second one.

Â And we could see that here with these two tail and heads.

Â And then we can call concat to do a vertical stacking,

Â where we're saying "take

Â these two data frames which are identical in terms of columns and stack them,

Â " and you can see that result.

Â We can do the same thing in the horizontal section.

Â In this case, we split based on the column into two new column data sets.

Â And we can see how they look and then we can concat those.

Â The only difference is we specify this axis equals one to specify.

Â Now, we're going to extend it in the column direction not the row.

Â X is equal zero would be row.

Â The last thing that I want to talk about is pivot tables.

Â They're a popular technique for generating summary information.

Â The simple thing that you need to do here to create

Â a pivot table is to call the pivot table method.

Â The first thing is the column that you're going to be summarizing.

Â The next one is index,

Â this is what becomes the index of your pivot table,

Â and then what are the columns you're going to include.

Â So, here you can see that our columns are

Â the four different types of days that are in our data set.

Â And the rows are the index column.

Â And it only takes two values male or female, because it's gender.

Â And we get a summary representation in here.

Â We can also reorder those,

Â so that gender is across these columns,

Â and day is now rows.

Â We can also do aggregate information,

Â so we can have the same data as before.

Â But now, we have what are called "margins" where we have the value

Â for the entire day of the week not broken out by gender,

Â the average of all of them, or again,

Â what's the average of all the males not broken out by day of the week.

Â The pivot table can include multiple columns.

Â So, we can do this on either the index where we now have Thursday.

Â And Thursday is now broken out by is the person a smoker or not.

Â We can also do it in the column by doing this similar thing,

Â but now we specify multiple columns,

Â such that we have the gender and then we have is a person a smoker or not.

Â We can also apply aggregate functions.

Â So, here we're going to take the total bill column,

Â and we're going to say we want the max, median,

Â and mean for the total bill displayed and you can see that's what we have.

Â We have the total bill column where the max median and mean.

Â We also then specify the gender and then the rows are day,

Â because that's what the index was.

Â You see that there's a very powerful thing going on here where we've built

Â this summary information or visualization,

Â if you will, of our full Pandas data frame.

Â Is the data changed?

Â The information in the pivot table will also change and provide that summary information.

Â Hopefully, you've seen the power of

Â this data frame and of the more advanced features in Pandas.

Â It really is something you want to learn and be able to apply,

Â because it can simplify a lot of task that you might want to be doing.

Â If you have any questions,

Â let us know in the course forums. And good luck.

Â