Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let’s continue our course.

In last unit, we have used identification of non-coding RNAs as an example to introduce the methods of feature selection.

In this unit, we will continue focusing on functional annotation

of non-coding RNAs to introduce the differential gene expression analysis and clustering.

This unit will involve more statistics knowledge than ever.

Therefore, we have marked the extra materials as “Additional Information” at the upper right corner.

These materials are not required for the exam.

After identifying some ncRNAs, how can we infer their possible biological functions?

First, for miRNAs and other ncRNAs whose mechanisms are clear,

we can infer their biological functions by predicting their targets by base-pairing or other methods implied by the mechanisms.

However, this method will not work for long non-coding RNAs and other non-coding RNAs whose mechanisms are not clear.

Nevertheless, we can still infer their functions by correlation with expression,

as genes with correlated expression profiles in expression regulatory networks tend to be functionally related.

specially,in pratical work,we majoring focus on two types of gene experssion corelation

genes that are differentially expressed under different conditions, and (pairs of) genes that are co-expressed under different conditions.

We will discuss them in details now.

In the ideal world where experimental error does not matter at all, it is trivial to detect differentially expressed genes.

In the ideal world where experimental error does not matter at all, it is trivial to detect differentially expressed genes.

In the real world, however, the case is much more complicated.

In fact, in real experiments

we always get a distribution, rather than a fixed value, for the measurement due to the existence of random errors.

Therefore, the comparison between gene expression levels

under different conditions is essentially a comparison between two distributions.

In other words, we need to consider not only the mean, but also the effect of variance as well.

For example, we can conclude with confidence from this figure that Gene g has its expression changed across different conditions,

because the smallest value of Gene g under Condition 2 is still larger than the largest value of Gene g under Condition 1.

However, with the mean values fixed, the conclusion might become totally different if we tune the variance a little.

As we can see now, the smallest value of Gene g under Condition 2 is smaller than the largest value under Condition 1.

In other words, the difference d we observed between the means might just be an artifact caused by random errors.

Therefore, we need to use statistical methods to make statistical inference based on probability models.

Specifically, we first need to construct a statistic that takes into account the variance.

We then calculate the p-value for each gene based on the NULL distribution of this statistic.

Finally, we choose the genes whose p-values are smaller than the given cut-off as genes that are statistically significantly differentially expressed.

For example, the t statistic constructed in the classical t-test can be regarded as

the ratio of the difference between the two means of the two distributions

to their standard deviations.

Assuming that the two distributions are both normal, the null distribution of the t statistic is the t distribution.

We can then easily compute the p-value by its corresponding t statistic.

The classical t-test, however, requires not only normal distributions but also enough replicates under each condition.

This is often impractical for RNA-Seq data analysis.

Therefore, different statistics and ways of differential expression analysis are developed for the properties of RNA-Seq data,

such as those based on Poisson or negative binomial distributions.

such as those based on Poisson or negative binomial distributions.

These methods make different assumptions, leading to different null distributions and thus different p-values and calling results.

To make it easier to choose appropriate methods,

Doron Betel et al. have systematically benchmarked by multiple datasets the commonly used differential expression analysis tools.

You can refer to this benchmark if you’re interested.The papers related are also discussed in this week’s student presentation.

Essentially, p-value is a probabilistic measure of statistical errors.

Specifically, we will meet two types of such errors in practice.

The type I errors, or the false positives,

occur when genes that are in fact not differentially expressed are treated as if they are differentially expressed.

The type II errors, or the false negatives,

occur when genes that are in fact differentially expressed are treated as if they are not differentially expressed.

When one of the two types error increases, the other always decreases.

Generally, we use p-values to denote the probability that the type I errors, i.e. the false positives, occur in ONE test.

In practice, we often need to run a statistical test for each of multiple genes.

We will then need to deal with the multiple testing issue.

For example, we run a statistical test for each of 20 different genes, and we get 0.05 for the p-value of each gene.

Then it means that the probability that we make an error each time is 0.05 .

In other words, the probability that we do not make an error each time is 1-0.05=0.95 .

According to the multiplication rule,

the probability that we do not make any error in 20 tests consecutively is 0.95 to the power of 20, which is about 0.358 .

Then the probability that we make an error in at least one of the 20 tests is 1-0.358=0.642 .

In other words, even if the probability that we make an error is 0.05 each time,

the probability that we make an error in at least one test is still larger than 0.5

This is the multiple testing issue.

The simplest way to handle this problem is to set a more stringent cut-off for p-value.

For example, in Bonferroni correction the original p-value is multiplied with the number of tests that have been run.

Therefore, we can consider a gene from all 30000 human genes as being differentially expressed,

only when its original p-value is smaller than 0.05/30000=1.67x10^-6.

We can then assure that in the worst case, the probability that the false positive occurs is less than 0.05 .

In practice, however, the Bonferroni correction is often too stringent.

When the false positive is guaranteed to be reduced, the probability that the false negatives occur is increased,

reducing the power of statistical tests.

Also, in practice we care more about how many false positive genes there are in genes

that have been marked as being differentially expressed, rather than in all genes under the statistical test.

In other words, we care more about FDR (false discovery rate) than FWER (familywise error rate).

We can handle this by transforming p-values into q-values.

Similar to p-value, q-value is also a measure of statistical errors.

However, q-value is different from p-value, as it measures false discovery rate.

In other words, given a specific gene g, q-value measures the proportion of false positives

among all genes that are as or more extremely differentially expressed.

Similar to differential expression, the co-expression relationship

under different conditions can also be used to infer gene functions.

Clustering gene expression profiles under different conditions will help quickly locate co-expressed genes.

Clustering gene expression profiles under different conditions will help quickly locate co-expressed genes.

Distance measure is the core of clustering methods.

Here the “distance measure” measures the similarity between the expression patterns of two genes.

Two commonly used distance measures are the Euclidean distance and the Pearson distance (or the correlation distance).

Two commonly used distance measures are the Euclidean distance and the Pearson distance (or the correlation distance).

The Euclidean distance focuses on the expression level, i.e. how similar the expression levels of the two genes are.

The correlation distance focuses on the expression pattern, i.e. how consistently the expression levels of the two genes fluctuate.

Different distance measures can lead to results that might differ a lot.

For example, when we use the correlation distance (i.e. focusing on the expression pattern),

it is the blue point and the red point that are closest in the figure,

because their ways of fluctuation are basically consistent, while the distance between the blue point and the gray point is relatively far.

When we turn to the Euclidean distance, however, it is the blue point and the gray point that are closer in the figure,

as their absolute values (i.e. expression levels) are closer to each other.

In practice, the correlation distance is used more often as co-expression

often refers to the trend of variation of expression.

However, when using the Pearson distance, we need to take into account the effect of outliers.

Pearson distance depends on the covariance at the level of population.

Therefore, some special outliers will dramatically affect the final result.

Here are some summary questions.

You are encouraged to think about them and discuss them with other students and TAs in the online forum.