You've heard about logistic regression in theory.

So how did you do it in practice?

I'm going to talk you through the things to check and

the basics of how to do it using our package of choice,

R. The first thing sounds obvious,

but like many obvious things,

it needs to be said.

First, you need to check whether your outcome is in fact binary.

Just tabulate it, for instance,

by using the table command.

If it has more than two values,

then you've got a problem and a decision to make.

Why does it have to be more than two?

Maybe it's just a handful of patients with values that have been entered wrongly.

If so, you can safely just exclude the affected patients.

If there are a lot of values,

then you can consider combining two of the values but only if it makes sense to do so.

If it doesn't make sense to combine groups or if you don't want to lose information,

which always happens when you combine categories,

then consider something like ordinal regression,

which is beyond the scope of this course.

In the example in this course, however,

diabetes is a yes-no variable.

Based on a threshold HBA1C value,

it will be binary unless HBA1C is missing.

In what's called simple logistic regression,

which I'm going to explain now,

you have just one predictor.

An example of when this is useful is if you want to look at time trends

and test whether there is a significant linear trend in your outcome.

I'll come back to that keyword,

linear, in a minute.

For instance, has the rate of diabetes recently been getting bigger or smaller over time?

To run logistic regression in R,

you need to use the GLM command.

As a minimum, you need to tell R what your outcome variable is,

what your predictor or predictors are,

what distribution you want to assume for

the outcome variable and which link function your want.

With GLM, you can run other kinds of regression too,

so this is why you have to tell it that the distribution is

the binomial achieved by the family equals binomial option.

The link function says how you want to transform the outcome variable,

in order to make the maths work.

So you get an equation who's right hand side is just the sum of one or more predictors.

The link function that's generally used in logistic regression is the logit.

This means you take the probability of the outcome

happening and turn it into the log odds,

which you came across earlier in the course.

There are other choices of link function

that are more appropriate if your outcome variable

really represents a continuous one or counts that you've just forced to be either 0 or 1,

but I won't go into them in this course.

So I will now discuss the predictors.

Your predictors can be categorical or continuous.

If a categorical, you do not need an equal number of observations in each category,

but categories with very small numbers can cause problems,

as we'll see later.

If continuous, they do not need to be normally distributed.

That's something a lot of students get wrong.

There is something you need to assume though.

For a continuous variable or one that you are essentially treating as continuous,

for example a year,

you are assuming that the relation between the variable and the outcome is linear.

For example, suppose you want to know how diabetes risk varies with age,

and you have age in whole years rather than its categories.

If you plot the rate of diabetes by age,

you are assuming, and R is assuming,

that the diabetes rate changes by the same amount for every one unit increase in age,

whether it goes up with age,

goes down with age or is flat.

No curve is allowed.

The relation is linear.

It's important to test that your data fits

this assumption by plotting the data first and then fitting a model.

Don't just hope for the best.

Raw assumptions matter in statistics as well as in life.

If the relation on your plot looks rather more curved than straight,

then maybe a line isn't a good approximation.

So in that case,

you will need to try some other shapes,

for instance, by adding a squared term to the model.

If your single predictor is age,

then this would mean including not just a term for age but also a term for age squared

and testing whether that square term is statistically significant via its p-value.

There are fancy things that you can do,

but those are the basics.

So those are the key elements to fitting

a simple logistic regression model, for instance,

with a binary outcome variable such as diabetes and a single predictor such as age in R,

using the GLM commands. Why don't you have a go?