A common occurrence in genomic and genetic data is that the outcome is

actually a binary variable rather than a continuous variable.

In that case one option is logistic regression.

So as an example of this, I'm going to illustrate it with a case-control study.

So suppose that you've collected a large number of cases, say a large number of

cases of people with cardiovascular disease and an equal number of controls.

So people that don't necessarily have had cardiovascular disease or

at least not measurably.

And then you genotype them in a bunch of loci.

For one particular locus, imagine that you can either have a C or a T,

then you can build this two by two table that says, if you're a C or

a T and if you're a case for control.

So then the next thing that you might want to ask is,

are those two variables related to each other?

So one option for

doing this is you could just fit the standard linear regression model.

So you could relate the case control status which is either a 0 or

a 1, to the genotype which is equal to 0 or 1 as well if it's a C or

a T based on this linear regression model.

And then you'd have an error term just like you'd have before.

The problem is here you can imagine getting a model fit and an error term such

that that fit was outside of 0,1 even though the variable itself is 0 or 1.

Moreover, and you could get any continuous number for

this regression over here on the right, and

you actually only had two potential real values on the left-hand side.

So that doesn't make a lot of sense.

So the first step that you could make is to recognize that's not

a continuous variable and instead try to model the probability.

So you could model the probability that your case and

you could model that as a function of the genotype.

Here we've now eliminated this error term because we're not modeling this

continuous variable anymore, we're just modeling a probability.

So we have some model here for that probability.

The only problem is, is that probability's always between zero and one.

And so, if you fit a linear regression model you might get values that are larger

than one or smaller than zero.

So another way they you could do this is you could take the log of the probability.

So that if you set the probability that the case is equal to 1,

equal to the variable p, and

you model the log of that probability as a linear function of the genotype.

This works a little bit better because this can have values between

negative infinity and zero which is now more like a continuous variable, and

you'll capture more of it with the regression model.

But you can go even farther, you can actually model the log odds.

So here we're going to model again the probability that your case is p.

So we're going to do p divided by 1 minus p.

So that variable can take on a larger number of values and

the log of that variable can take on any value between minus infinity and infinity.

This now, makes sense for a continuous regression model like we have here that's

regressing on the basis of the genotypes.

But now, we have a little bit of difficulty because we're modelling

a relationship about the log of a variable versus the genotype.

And so what are the coefficients that we're estimating now?

This coefficient is interpreted as the increase and

log odds of case status given a genotype.

Let's talk a little about odds and log odds.

They're a little bit tricky to interpret.

So let's start off with a simple example.

Imagine that you can have one of three genotypes.

You can either have two copies of the major allele, one copy of the major allele

and one copy of the minor allele or two copies of the minor allele.

Suppose that in this case,

the phenotype that we're after is whether you died or not.

So suppose that there's a 33% chance you die if you have two copies

of the major allele, 50% chance if you're a heterozygote, and

90% if you're homozygous for the minor allele.

So then what we can do is we can calculate the probability of that phenotype for

each of these different genotypes.

The odds then is the ratio of the probability of death

to the probability of not death.

So in this case it's one to two is the odds,

one over two is the odds of death here.

In this case, since it's a 50,

50 chance the odds is actually equals to one, it's just a ratio of one to one.

And in this case, the probability of death is 90%, and

the probability of not dying is 10%.

And so the odds is actually 9 over 1 or 9.

So the odds is a number that can range from zero to basically infinity.

It could be as big as you want depending on what this probability is.

So the log odds is then going to be the log of that number,

it's going to range between minus infinity infinity.

So here's an example of what an odds ratio of two looks like for

a continuous variable.

So on the x axis here, we have the variable x and

it ranges in values from minus three to three.

And then we have the probability of surviving over ten years.

So again, that's just this little value is the probability of death,

if you have a covariate value of minus three.

And then say at this point,

this is the probability of death if you have a covariate value of zero.

And so what you're doing here is you're looking at the change in the probability

of death associated with this covariate follows this logistic curve.

So you can see that this is the characteristic sort of logistic

curve that you get when you're modeling things on this scale.

And so for an odds ratio too, you can see that there's a relatively linear decline

here in the middle and then at the ends it flattens out.

The change in the in a sort of odds adept.