An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

115 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most common causes of false positives in genomics are batch effects or

Â other confounders.

Â So I'm going to show you a little example of how you remove those in r.

Â So I'm going to set up my plotting parameters like I'd like and

Â then I'm going to load the libraries that we need.

Â In this case, it's the SVA package and the snip stats package in the main

Â drivers of most of the analysis that we're going to be doing here.

Â So I load those packages, and then I'm going to load this bladder data set.

Â In this case, I have a expression set, a bladder expression set.

Â I'm going to extract the phenotype information and

Â the expression information from it.

Â And so now I have an expression object that has 57 samples in it,

Â and I have some information on the phenotypes for that.

Â And so in this case I have information on the outcome and I also have information

Â on a batch variable and then whether it's a cancer sample or not.

Â And so the first thing that you might want to do if you want to adjust for

Â you want to deal with the batch effect is if it's labeled like this, if you actually

Â have the batch variable you might be able to directly adjust for it.

Â So the way that you could do that is you could build a model matrix for

Â the cancer variable.

Â And then you could actually just add in another one, the batch variable,

Â say where you want the data to come from, that's from this phenotype data.

Â So now I've got this model matrix, which I can then fit with lm.fit.

Â We learned in this last lecture and so now

Â I have these coefficients that have been adjusted for batch effects, basically.

Â I've adjusted for them directly with the lm.fit function.

Â So I can now look at the coefficients for

Â the variable that I care about across all of the different samples.

Â And so that works out pretty well the one thing that you have to be a little bit

Â careful of is if there's a tight correlation between the batch variable and

Â the outcome variable that you care about.

Â So ideally you'll have balanced within each batch you'll have

Â observations from multiple different groups.

Â In this case it's a little bit tricky, because for

Â example say the fourth batch only has biopsy samples.

Â So fortunately they're not perfectly confounded in the sense that

Â in the fifth batch you actually had some biopsy and some cancer samples.

Â And similarly you have some comparison between cancer and

Â normal within the second batch so you have some information to gain there.

Â But if these are perfectly correlated, in other words if all the cancer samples

Â are run in batch 1 and all the normal samples are run in batch 2.

Â You can't use this,

Â any of the adjustment methods I'm going to be talking about here.

Â So you have to be very careful about that at the experimental design stage.

Â Another way that you can adjust for batch effects if you want to,

Â if you actually know them if you have them in advance is to use Combat.

Â Which is basically is the similar approach to direct linear adjustment but

Â it basically does some of the shrinkage we talked about.

Â So Ttat can have some advantages when you have especially small sample sizes.

Â And so, here I have to build a model matrix for the combat function.

Â And so, the model matrix in this case,

Â I'm going to tell it has nothing but an intercept term.

Â And then I have the model that I'm actually going to test.

Â Which is the model that has the cancer term in it.

Â So then what I can do is when I'm running combat,

Â I can basically run combat to get a cleaned expression data set.

Â So I do combat and then I tell it what data to look for

Â in this case it's the expression data and I tell it what the batch variable is and

Â then I tell it is there anything else that it should be adjusting for.

Â And then, it basically fits

Â a set of models and you can have it either plots or output, or not.

Â And so, what will end up happening, though, is that it's going to fit

Â this model, and it's going basically regress out the badge effect.

Â And so, then you can go through, and you can do lm.fit on the cleaned data.

Â So you basically get the combat fit on this data and

Â then you can fit the lm.fit model with our cancer model

Â matrix now with the data set that was cleaned with combat.

Â So now I have a combat fit coefficients which I can look at as well.

Â So I can look at the coefficients from this.

Â So I can see here that they seem to be a little bit smaller than the coefficients

Â that I got when I fit the direct linear adjustment, and

Â in fact, that tends to be the case.

Â So here I'm going to make a plot It has a few parameters so I'm not going to

Â type the whole thing here but I'm going to plop the coefficients from the original

Â linear model fit versus the combat model fit.

Â So if I do that I can actually add the 45 degree line.

Â Actually, here's what I'm going to do I'm going to make it a one by one plot and

Â then I'm going to make that plot and add the 45 degree line.

Â And so you can see for

Â example that the combat values tend to be a little bit smaller, so the y axis is

Â a little bit smaller than the x axis here because we've done that shrinkage.

Â So the other thing that can happen is imagine you either don't know that batch

Â variable, or imagine that batch variable doesn't encompass all the different

Â potential technical artifacts, then you might want to infer the batch variables,

Â and you can do that with the SVA package SVA function.

Â So here I'm going to build that model matrix with the cancered outcome that I

Â care about.

Â And then I'm going to build the model that I'm going to compare that to.

Â So, in this case, I'm going to compare it to a model with no covariants in it so

Â it's the null model.

Â And then what I can do is actually create some covariants by

Â running the SVA function on this data set.

Â So I pasted the expression data set, the model for including the variable I care

Â about, the model not including the variable I care about.

Â And then I tell it how many surrogate variables, or

Â how many surrogate batch effects it should estimate.

Â So then it's going to go through and

Â estimate those batch effects through an iterative procedure.

Â So now what I can do is I can actually see that the SVA object that's returned has

Â this SV component, which is basically new co-variants that have been created.

Â That are the potential batch effects, and so for example, I can plot,

Â or I can correlate those batch effects with the actual observed

Â batch variable, and I can see that our estimated surrogate variable is,

Â this second one is super highly correlated with the batch file.

Â The first one isn't necessarily as correlated with the batch variable

Â as the first one.

Â So then I can make a plot of that second surrogate variable

Â versus the batch variable.

Â And I can see that there's like a relationship between the inferred

Â surrogate variable, the inferred batch effect, and the observed batch effect.

Â And so then I can actually plot these

Â data points on top of that to see if that's true.

Â So what we've done here with the SVA is not necessarily actually cleaning

Â the data set.

Â We've just identified new covariates that we now need to include in our model fit.

Â So we can combine them with our original model that had the cancer term in it.

Â And so we end up with a model fit that includes both

Â our estimated surrogate variables as well as our cancer status variables.

Â And so then I can use lm.fit or any of the other methods we talked about in

Â the previous lecture to get the model fit after you adjust for those variables.

Â So now what I can do is I can see what the model fits look like

Â comparing SVA to combat for example.

Â Here, I'm plotting SVA versus combat.

Â So on the x axis is going to be the SVA adjusted values,

Â on the y axis is the combat adjusted values.

Â So you can see that they're actually pretty highly correlated with each other.

Â And then you can do the same thing, for

Â the SVA versus the just the linear model adjusted without the combat adjustment.

Â And you can see here, that these aren't necessarily shrunken and so

Â they are maybe a little bit more correlated without the shrinking.

Â So that's basically how you can run SVA on

Â the data set to infer the batch effects if you don't know what they are.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.