0:02

So this lecture's about why you should care about statistics.

As Steven already told you,

genomic data science consists of three different components.

There's biology, computer science, and statistics.

When people talk about genomic data science,

they often think about biology and computer science,

and I think statistics often ends up being the third wheel.

And so this lecture's to hopefully motivate you as to why statistics is

a very important component of genomic data science.

This is a really exciting result that came out in the Journal of Nature Medicine.

And so, the results suggest that it's possible to take genomic measurements and

predict which chemotherapies are going to work for which people.

This is an incredibly exciting result in genomic data science,

because it was sort of the holy grail, using genomic measurements to personalize

therapy, and particular, particularly personalized therapy for cancer.

And so, everybody was very excited about this, and people at all,

institutions all over the world tried to go back and reproduce that result.

And so, one of those groups was at MD Andersen Cancer Center.

So, that group of people consisted of two statisticians, Keith Baggerly and

Kevin Coombes.

And those statisticians tried to chase down all of the details and

reperform the analysis.

They did this because their collaborators were really excited about it and

actually wanted to use it at MD Anderson in order to tailor therapy.

But it turned out that there were all sorts of problems with the analysis, and

they had trouble getting a hold of the data.

And so because of these problems,

they were actually unable to reproduce most of the analysis.

And this ended up being a huge scandal in the world of genomic data science, because

this very high profile result, this result that everybody was sort of chasing after,

turned out to sort of not work out once all the details were checked out.

So this is actually an ongoing saga.

It, actually started off as a sort of a discussion between the statistician

at MD Anderson and the group at Duke that actually performed the original analysis.

And over time, they had a large set of interactions where they were trying to

settle on the details of how the analysis was performed.

It turned out that due to some lack of transparency by the people who

did the original analysis,

clinical trials actually got started using this technology.

They were assigning chemotherapy to people using sort of an incorrect data analysis,

and it was because the statistics weren't actually really well worked out.

This is so

serious that now there are ongoing lawsuits between some of the people that

were involved in those clinical trials who had been assigned therapy and

the institution Duke that actually was behind the creation of these signatures.

So missing out on why statistics will be part of the genomic data science pipeline

caused a major issue, so big that actually lawsuits were generated.

2:29

This actually spurred an Institute of Medicine report.

So this Institute of Medicine report dictated that there are a whole

new set of standards by which people should develop genomic data technologies.

And much of this report focused on statistical issues, reproducibility,

how to build statistical models,

how to lock those statistical models down, and so forth.

And so, the first issue, the first thing that we,

I, I hope to motivate you is that we should care about statistics.

And I've just got a couple of silly examples here.

This is actually from a published abstract of a paper.

And in the abstract,

you can see where I've highlighted, that it says, insert statistical method here.

So, the authors of this paper cared so little about the statistical analysis

that they left a generic statement about what statistical method they were using.

So this sort of suggests how sort of the relative ranking of where statistics

falls in people's minds when they're thinking about genomic data science.

And that sort of issue can cause major problems like we saw with

the Potti scandal.

So this is actually also not just in genomics,

it's actually a more general problem.

So this is actually from a flyer from Berkeley, and so they talk about all

the different areas that are sort of applying data science these days.

And if you notice, statistics is listed, but there's actually no application area.

And so this sort of, again, suggests that people think of statistics not necessarily

as something that's important for data science.

And that sort of lack of statistical thinking is a major contributor to

problems in genomic data analysis, both at the level of major projects,

but also at the level of individual investigators.

And so the question is, how do we sort of change this perspective and

how do we make sure that people care and know that caring about

statistics is just as important about, as caring about the biology or

the computer science when doing genomic data science.