0:02

Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let's continue this week's topic: Markov Model.

In the last two units, we introduced Markov Chain and the application

of Hidden Markov Model (HMM) in sequence alignment.

In this unit, we’ll illustrate the application of Hidden Markov Model (HMM)

in prediction and feature recognition.

As mentioned before, there is a separation of state and observation in Hidden Markov Model.

Thus an observed token sequence might arise from several different state paths.

Each state has its own emission probability distribution.

From each state we can get a set of observable tokens with different probabilities.

This makes it possible to infer the state paths from the observed token sequences.

Specifically, we can calculate the possibility of every possible state path

emitting the given observed token sequence.

The state path with the highest probability is the path that

is the most likely one to produce the token sequence.

We will illustrate this below.

Let’s start with a simple gene prediction.

We’ll predict the coding region of a segment of genome DNA sequence.

According to the Hidden Markov Model (HMM) introduced last time,

we’ll first distinguish the hidden states that are unobservable from the tokens that are observable.

In this case, we can identify clearly that the observable token sequence

is the genome DNA sequence.

The hidden state is whether the current region is coding or non-coding.

Thus we can draw a state diagram.

First, there are two states: coding and non-coding.

The genome contains both coding and non-coding regions.

Thus, the two states can transit to each other.

Of course, each state can also transit to itself,

implying successive coding or non-coding regions.

Now we have a 2x2 transition matrix, which is alpha_{k,l}.

Next, let’s write down the emission probability, which is e_k(b).

It is very straightforward that whether the region is coding or non-coding,

it could always emit four bases:A, C, G, and T.

So we get another two matrices.

Next, we need a training set to fill in the three matrices with specific values.

Specifically, we need an annotated DNA sequence

where the coding and non-coding regions are correctly delineated.

Generally, the sequence is long enough to provide enough training data.

3:30

Assume that we’ve filled the transition probability matrix

and the emission probability matrices by the analysis of the training set.

Taking advantage of these data,

we could obtain the most possible state path from an unknown genome.

which is just the state path with the highest probability.

Therefore, we need to write down the recursive formula

and the termination equation using the dynamic programming.

Now all the data and the formula are ready. Let's start computing!

Hold on.as the formula implies, we need to do a lot of multiplication.

It is very slow and, when applied in computer,

with more multiplications the value will become so small that it might underflow.

4:33

Therefore, we usually introduce logarithmic computation to replace multiplication with addition.

Specifically, we take the log (with a base of 10) value of transition probability

and emission probabilities in advance.

Okay, let’s start with a sequence!

First, let’s draw the recursive matrix of dynamic programming.

There are two states: coding state c and non-coding state n.

Next, we need to set the boundary condition

which is the default probabilistic distribution of the two states.

After taking log(log10) transformation,

the probabilities turn into -0.097 and -0.699, respectively.

Next, we will fill in the cells one by one.

First, we meet the first token: C.

From the emission probability matrix, we can see that

its emission probability under the non-coding state is -0.523.

Adding it to -0.097 will get -0.62.

Similarly, this number will become

-0.699 + (-0.699) = -1.40

under the coding state.

Next, we need to move forward by one residue, which requires one step of state transition.

From the transition matrix we can see that the transition probability here is -0.097.

Adding it to the emission probability of residue G under the non-coding state,

which is -0.523, we get -1.24.

Similarly, we can compute the [probability of] transiting from the coding state

to the non-coding state, which is -1.40 + (-0.398) + (-0.523) = -2.32.

Please note that this value is smaller than the -1.24

coming from the non-coding state, and will thus be discarded.

Similarly, we can finish the succeeding recursions and fill in all the rest cells one by one.

OK.Next we will do backtracking.

First, pick up the [cell at the end with the] largest probability.

Start from this cell and backtracking step by step...

OK，We can get the final result when I get the backtracking path.

In other words, we divided the input sequence CGAAAAAATCG

into non-coding regions N and coding regions C.

Due to time limitations, we have only created a very simple MSGP.

However it can be easily extended simply by adding more states.

The only limitation here is that the emission probabilities of different states

which are the content of residues here should differ very much.

Only in this case can we infer from the observed token sequences the state path

For example, Chris Burge developed a gene prediction algorithm, GenScan, in 1996.

This algorithm sets different states for exons, introns, and UTRs,

which improves its accuracy of prediction considerably,

making it one of the most successful gene prediction tools.

However, it does not differ much from the MSGP we have just discussed with respect to basic ideas;

it just introduces more states.

Similarly, you can also use similar methods to predict the 5' splicing sites and so on...

The HMM has provided for the data analysis of bioinformatics an effective probabilistic framework

by separating states and observable tokens.

This model is one of the most widely used algorithm models in current bioinformatics research.

We will provide you with supplementary materials

concerning more applications of Hidden Markov Model in bioinformatics.

Thank you, and that's all for this week.See you next week!