In this video, we will learn how to find out if there is a relationship between two categorical variables. When dealing with the relationships between two categorical variables, we can’t use the same correlation method for continuous variables, we will have to employ the use of chi square test for the association. The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent. Before we go into an example, let's go through some important points. The Chi-square tests null hypothesis is that the variables are independent. The test compares the observed data to the values that the model expects if the data was distributed in different categories by chance. Anytime the observed data doesn't fit within the model of the expected values, the probability that the variables are dependent becomes stronger, thus proving the null hypothesis incorrect. The Chi-square does not tell you the type of relationship that exists between both variables only that a relationship exists. We will use the cars dataset. Assuming we want to test the relationship between fuel-type and aspiration; these are categorical variables. It is either the fuel-type of the car is gas or diesel, and the aspiration is that either the car is standard or Turbo. To do this we will find the observed counts of cars in each category. This can be done by creating a crosstab using the pandas library. A crosstab is a table showing the relationship between two or more variables. When the table only shows the relationship between two categorical variables, a crosstab is also known as a contingency table. In our case the crosstab or contingency table is shows us the counts in each category: a standard car with diesel fuel, a standard car with gas fuel, a turbo car with diesel fuel, or a turbo car with gas fuel. The formula for chi-square is given as follows The summation of the observed value i.e., the counts in each group minus the expected value all squared divided by the Expected value Expected values are based on the given totals, that is what can we say individual cells would be if we did not know the observed values? To calculate the expected value of a standard car with diesel, We take the row total which is twenty multiplied by The column total one hundred and sixty-eight Divided by the Grand total of two hundred and five This will give you Sixteen point three nine If we do the same thing for Turbo cars with gas fuel, we will take Row Total One hundred and eighty-five multiplied by Column total – thirty-seven, and we divide by the Grand total – two hundred and five we get Thirty-three point three nine If we repeat the same procedure for all of them we get these values If we took the row totals, column totals, and grand total we will get the same values as the totals as the observed values Now going back to this formula, if we took a a summation of all the Observed minus the expected values all squared divided by the expected value, we will get a chi-square value of twenty-nine point six On the chi-square table we check on the degree of freedom equals one row and find the value closest to twenty-nine point six, here we can see that twenty-nine point six will fall in between a p-value less than zero point one and zero point two five. Therefore, we can say the p-value is greater than zero point one. Since the p-value is greater than point zero five we reject the null hypothesis that the two variables are independent and therefore we conclude that there is an association between fuel type and aspiration. To do this in python we will use the chi square contingency function in the scipy dot statistics package The function will print out the chi-square test value twenty-nine point six and the second value is the p-value which is very close to 0 and a degree of freedom of 1. If you remember the chi-square table did not give an exact p-value but a range in which it falls, python will give the exact p-value. We can see the same results as our previous slides. It also prints out the expected values which we also calculated by hand since the p-value is close to zero, we reject the null hypothesis that the two variables are independent and conclude that there is evidence of association between fuel-type and aspiration.