Traditionally, biologists worked at a bench
and investigated individual biological entities of interest.
Today, we have something called high throughput biology.
And the way this works is that,
on a single gene chip,
you can have thousands of genes,
all of which can be measured at the same time in parallel.
And so what you're going to do,
is take a few samples of one type of tissue.
Let's say healthy tissue and a few samples of,
let's say disease tissue.
And you just measure the level of gene expression in each for every gene.
Comparing the levels of gene expression in the two tissues,
you can look to find genes where there
was a substantial difference in the expression level,
between the healthy tissue and the disease tissue.
And you can then have something to do with that disease process,
and you could investigate that further.
This kind of high throughput biology has been extremely effective in giving us clues
to many disease processes that we didn't understand before.
However, from a statistical perspective,
there are some issues we should think about.
When you want to compare samples from two populations,
the way that you do this usually is by means of a p-test.
And let's say that you take a p value of 0.95,
what this says is,
that there is a 95% chance that
the difference that you're seeing in the expression levels across the two populations,
is something that was not due to chance.
That there is only a 5% probability that it is due to chance.
Now, this is true for any one single gene,
where the expression level is
different in terms of the p-value that we just talked about.
But now we do this for 25,000 genes.
If for any one single gene,
we have a 5% chance that the difference we're seeing is purely to your chance.
Then if you look at 20 genes,
one of them on average is going to have a difference purely due to chance.
And if you look at 20,000 genes,
a thousand of them on average are going to
have a difference in expression purely due to chance.
The actual number of genes that are involved in
the disease process may be only a few dozen.
And so we have a few dozen genes of interest where the expression level is different,
because there is a real biological difference.
And possibly a few hundred,
maybe a thousand genes where the expression levels are different purely due to chance.
In other words, you've got some information there,
but you've also got a lot of noise.
And how to separate this information from this noise requires,
careful downstream analysis as opposed to
taking what one sees as the gospel truth,
simply because it meets some p-value criteria.