0:08

Now we're going to use the chi-square test of independence to

test the hypothesis I proposed about smoking frequency and nicotine dependence,

from working with NESARC data.

Specifically, is how often a person smokes related to

nicotine dependence among current young adult smokers?

0:26

Or in hypothesis testing terms, is smoking frequency and

nicotine dependence independent or dependent?

That is, are the rates of nicotine dependence equal or

not equal among individuals from my different smoking frequency categories?

0:44

For this analysis,

I'm going to use a categorical explanatory variable with six levels.

The number of days smoked per month, which you may remember I called USFREQMO,

with the following categorical values.

Smoking approximately 1 day per month, 2.5 days per month 5 days per

month, 14 days per month, 22 days per month and 30 days per month.

1:23

To run this in Python, we'll import the SciPy stats library.

Next we will request our contingency table of observed counts, which I am calling

ct1, and we'll use the Pandas crosstabs function to generate these.

2:54

As an extra note, in Python, the object ct1 here is actually called

a two-dimensional array, where the columns represent the first dimension,

called axis = 0, and the rows represent the second dimension,

called axis = 1 Finally, I request chi-square calculations,

which include the chi-square value, the associated p-value, and

a table of expected counts that are used in these calculations.

I call these calculations cs1 and ask Python to print them.

3:27

My results first include the table of counts

of the response variable by the explanatory variable.

You can see that there were 64 participants who smoked approximately

one day a month without nicotine dependence.

And seven participants who smoked once a month with nicotine dependence.

3:47

At the other end of the table, among smoking daily, that is 30 days a month,

521 participants do not have nicotine dependence.

And 799 do have nicotine dependence.

4:12

Examining these column percents for those with nicotine dependence,

that is, TAB12MDX = 1, we see that as smoking frequency increases,

the rate of nicotine dependence also increases.

Now, looking at the chi-square results, the chi-square value is large, 165.

And the P value, shown in scientific notation, is quite small.

Approximately 7.4e-34.

Which clearly tells us that smoking and

nicotine dependence are significantly associated.

So why did we calculated the column percents?

To better understand this choice, let's look at three different tables that pull

apart the different numbers represented in a cross-tabs contingency table.

For example, we're gonna use percentages from a chi-square table examining

the distribution of insured and uninsured individuals by geographic region.

5:09

Table A shows row percentages.

Each cell includes the percent of observations within each row.

That is, within region Northeast, Midwest, South and West.

That are either insured or uninsured.

5:26

As you can see,

adding across the rows gives us 100% of the observations within region.

Table B includes the total percent of observations in each cell.

Here, the percentage in each row and column add up to 100%.

Finally table C shows column percentages.

Each cell includes the percent of observations within column

that is within groups either insured or uninsured.

6:00

So which of these percentage types should we calculate when trying to interpret

the chi-square results for smoking frequency and nicotine dependents?

If the output is set with the explanatory variable categories across the top of

the table, and response variable categories down the side,

it will be the column percent that we want to interpret.

6:20

In other words, we're interested in whether the rate of nicotine dependence

differs according to which explanatory group the observations belong to, that is,

which smoking frequency group.

Notice that we are not interested in the column percentages for

those observations without nicotine dependence.

Indicated with a dummy code of 0.

Instead, we're interested in describing the presence of nicotine dependence within

the smoking frequency groups; that is, these column percentages circled in blue.

If I want to graph the percent of young adult smokers with nicotine dependence

within each smoking frequency category, I would first import the seaborn and

matplotlib.pyplot libraries and then add the following code.

First setting out explanatory variable to categorical and

a response variable to numeric.

And then requesting a bivariate bar chart.

With smoking frequency categories on the x-axis, and the mean for

nicotine dependence, which is the proportions of ones on the y-axis.

Now I can visualize the association, and see even more clearly that there seems

to be a positive linear relationship, that is the more days per month a young adult

smokes, the more likely they are to have nicotine dependence.

I know from looking at the significant P value,

that I will accept the alternate hypothesis.

That not all nicotine dependents rates are equal across smoking frequency categories.

If my explanatory variably had only two levels,

I could interpret the two corresponding column percentages and be able to say

which group had a significantly higher rate of nicotine dependents.

But my explanatory variable has six categories.

So I know that not all are equal.

But I don't know which are different and which are not.