This method is very similar to the post hoc pair-wise comparisons

that you may have conducted as a follow-up to running an analysis of variance

in the second course of this specialization, Data Analysis Tools.

That is, reference group coding allows us to compare

each group of our explanatory variable, referred to as the comparison groups,

to another group, which is referred to as the reference group.

For example, if our response variable is the number of nicotine dependence

symptoms, reference coding allows us to compare number of nicotine dependence

symptoms for each group of our categorical variable to a designated reference group.

However, unlike an analysis of variance post hoc test, for

which we conduct the comparisons after testing the ANOVA,

the comparisons are part of the estimation of the multi regression model.

This allows us to examine explanatory variable group differences on the response

variable after adjusting for the other explanatory variables in the model.

To demonstrate how to analyze a categorical explanatory variable with

three or more categories, we will return to our NESARC data multiple work

aggression analysis, predicting number of nicotine dependent symptoms for

multiple explanatory variables.

We could also add an ethnicity-race explanatory variable.

Our ethnicity-race variable has four categories coded 0 = Hispanic,

1 = non-Hispanic White, 2 = non-Hispanic Black, and

3 = non-Hispanic Other ethnic or racial group.

In this example, what we wanna know is whether Hispanic individuals have more or

less nicotine dependence symptoms compared to individuals from the other

three racial, ethnic groups.

That is, we want to compare Hispanic individuals, the reference group,

to individuals from the other racial ethnic groups, the comparison groups,

on a number of nicotine dependence symptoms after controlling for

the other explanatory variables in the model.

To do this, we will use the same smf.ols function

that we used to test our earlier multiple regression model.

So we have our regression equation for which our NDsymptoms response variable

is being predicted by the explanatory variables DYSLIFE, MAJORDEPLIFE,

numbercigsmoked_c, age_c, SEX.

We add our ethnicity race variable,

ETHRACE, to the list of explanatory variables.

But to tell Python that it is a categorical variable,

we need to type a capital C and then put the name of the categorical variable in

parentheses after the capital C.

In this example,

we want to compare the Hispanic group to the three other ethnicity race groups.

So this will be our reference group.

If you remember, our ethnicity race variable is coded 0 for Hispanic.

The default and

Python is reference group coding, which in python is called treatment coding.

And the default reference category is the group with a value equal to 0,

which is Hispanic in this case.

Since this is what we're looking for

in this example, we do not need to add any code to change the default.

If we hadn't added a capital C with the ETHRACE variable in parentheses, Python

would have assumed that our ethnicity race variable was a quantitative variable, so

the regression coefficient would make no sense.

Here's the output.

Basically it is the same output that we see with the smf.ols function.

But, if we look at our table of parameter estimates, we see that there are three

regression coefficients for our categorical ethnicity race variable.

Note that there is no estimate for the Hispanic reference group.

The t dot and the number after it tells us that the treatment,

that is reference group, parameterization was used and

the number is the categorical variable code for the group.

For example, the non-Hispanic white group in our ETHRACE variable was coded 1.

So the t.1 indicates that it is the regression coefficient for the comparison

of the non-Hispanic White ethnic race group to our Hispanic reference group.

The three regression coefficients compare each of our ethnicity race groups

to the Hispanic group.

We can see that none of these three groups were significantly different from

the Hispanic group in number of nicotine dependent symptoms

because the p values all exceed our alpha level of .05.

As with the previous regression analysis, we see that major life depression and

number of cigarettes smoked are positively associated with number of nicotine

dependent systems.

If we wanted to make other comparisons, for example, to compare non-Hispanic White

to non-Hispanic Black, then we would need to override the default reference group so

that the value of 1 in the ETHRACE variable,

which indicates the non-Hispanic White group, is used as the reference group.

The code here shows how to do it.

It's mostly the same code, but now because we are not using the default,

we need to add some code to tell Python to continue to use the treatment or

reference group coding and designate the reference group.

We do this by adding a comma after the name of our ETHRACE variable in

parentheses.

Then treatment with a capital T.

And then within another set of parenthesis, reference=1.

This additional Python code provides a comparison of

the three other ethnicity race groups to the non-Hispanic White group.

Here's the output.

Now the group code at 1, no longer has a parameter estimate and

the other coefficients for t.0, t.2 and t.3 compare each of the other three

racial ethnic groups to the non Hispanic white group.

Participants in the non Hispanic other ethnic racial group

had a significantly greater number of nicotine dependent symptoms compared to

non Hispanic white participants.

There are no significant differences for Hispanic and

non Hispanic black participants compared to non Hispanic white participants.