Statistical interaction describes a relationship between two variables that is dependent upon or moderated by a third variable. For instance, do you prefer ketchup or soy sauce? Obviously, your answer depends on what food you're eating. If you're eating sushi, you probably prefer soy sauce. If you're having a burger and fries, you're probably going to want ketchup. In this case, the third variable is referred to as the moderating variable, or simply the moderator. The effect of a moderating variable is often characterized statistically as an interaction. That is, a third variable that affects the direction and/or strength of the relation between your explanatory or x variable and your response or y variable. What if the population we're studying has different subgroups? Could it be that, like the soy sauce ketchup example, different subgroups could have a moderating effect on our association of interest? To explore this idea, we're going to use a hypothetical study and some made up data. In our imaginary study, we're looking at two diets and their effects on weight loss. Diet A is a low-carbohydrate plan. Diet B is a low-fat plan. Our hypothetical study also recorded data on which exercise program participants chose; cardiovascular exercise or weight training? Our variables of interest are diet and weight loss. We've added this third variable, exercise plan, to help us understand moderation or statistical interaction. What's the association between diet plan A and B, our explanatory variable, and weight loss, our quantitative response variable.? This table shows our hypothetical data showing diet, weight loss and exercise plan. Since we have a categorical explanatory variable, diet plan A or B, and a quantitative response variable, that is weight loss, we will of course need to use analysis of variance to evaluate the association. This model python syntax should look familiar to you, where I name my model, include the equal sign and the OLS function from the stats model's formula API library. Within parenthesis, I write my formula, including the name of my quantitative response variable, followed by a tilde, and then the name of my categorical explanatory variable. I indicate to Python that this is a categorical variable by adding a capital C and putting the variable name within parenthesis. Then I print the model using the summary function. In this diet and exercise example, the syntax would look like this. The resulting output from my analysis is shown here. As you can see, weight loss is our response or dependent variable. There are 40 observations in the data set. The F value is 12 and it's associated with a significant p value, that is a p-value less than 0.05. Well, this tells us that there is an association between diet type and weight loss. To understand that association, we need to look at output using the group bi-function. Here, I create a new data frame with the variables of interest, and request means and standard deviations for weight loss by type of diet. As you can see, the average one month weight loss for diet A is about 14.7 pounds, and the average one month weight loss for diet B is about 9.3 pounds. In conjunction with the significant p value, we can say that diet plan A is associated with significantly greater weight loss than diet plan B. Here we show the finding graphically as a bar chart with diet, the explanatory variable on the x-axis, and the mean weight loss, our response variable on the y-axis. What about a third variable, exercise program? Would we get the same results in terms of the association between diet and weight loss for those participants using cardio, and those participants using weight training?