Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science

Liping Wei 魏丽萍, Ph.D.

Professor, Director Center for Bioinformatics, School of Life Sciences

Hello everyone. This is the supplemental video.

The main purpose for this video is to clarify some concepts and to give some advices for application.

First, I will clarify the concepts of homology and similarity.

Then, I will talk about similarity matrix, which is often applied as the scoring matrix in sequence alignment.

And in the end, I will introduce dot matrix, which is a visualized sketch for sequence alignment.

In biology, homology means two or more things have a common ancestor.

At the level of gene or sequence, especially in phylogeny studies, homology is sometimes classified into two types: orthology and paralogy.

Orthology indicates that two sequences in different species were actually the same sequence in their common ancestor in the past.

The separation occurred due to speciation event.

Paralogy usually indicates that two sequences in one species came from the same old sequence;

Sometimes paralogy also involves two or more species. The two sequences are formed due to (sequence) duplication event.

This is a schematic diagram for orthology and paralogy concepts.

Orthology comes from speciation events,

such as the animal-fungi speciation and human-worm speciation in the diagram.

If the gene is not lost during evolution, these speciation events will leave for each descent at least one copy of this ancestral gene. For example, the HB and WB in the diagram are orthologous.

For example, the HB and WB in the diagram are orthologous.

Paralogy comes from duplication events, especially the duplication in the genome within a species.

For instance, in the diagram, the A-B gene subfamily comes from a duplication event in its ancestral species.

And there are another two and one duplication events for this gene family in human and worm, respectively,

which has produced paralogs such as HA1-HA3 andWA1-WA2.

Now let's talk about similarity and identity. They are two different concepts.

Let's take amino acids as an exmple.

They can be divided into several groups based on their chemical properties, such as alkalinity/acidity, hydrophobicity, or aromaticity.

The amino acids within the same groups can be regarded as being similar to each other.

On the contrary, identity means a totally identical relationship. For example, A is the same to A, and G is the same to G.

Homology and similarity are closely related in biology.

We call two sequences homologs if they share a common ancestral sequence.

If the evolution time is short, the difference will be small, and the two homologous sequences often show similarity.

On the other hand, if the evolution time is long, there will be more difference, and their similarity will be low, sometimes even very hard to detect.

Homology often brings similarity.

Also, it is easy to get the sequence and measure the similarity. Therefore, we usually try to infer homology based on sequence similarity.

This idea is usually effective and feasible,

But it does not ensure an 100% accuracy, as convergent evolution may sometimes bring about similarity as well

Nevertheless, convergent evolution seems to be rarely observedat the sequence level.

We have clarified the concepts about homology and similarity.

The next question for bioinformatics is how to make computer search for the homology or similarity.

First, it needs a quantitative measure for similarity,

which turns out to be the similarity matrix, or the so-called scoring matrix for alignment.

For nucleotides, there are only 4 types of nucleotides for DNA or RNA.

Therefore, we often use as the scoring matrix a simple identity matrix,

which has the same positive score only on the diagnal [and 0 for other elements].

In phylogeny reconstruction, researchers usually use more complicated substitution models,transversion.

to better describe how the bases change during evolution.

For instance, we generally think the changes within pyrimidines or within purines are more frequent than changes between them.

In other words, transition happens more easily than transversion.

For amino acids, although their biochemical proporties provide some quanlitative similarity,

the universally accepted, quantitative similarity matrix comes from the evolutionary idea,

which uses real frequencies of transitions based on known multiple sequence alignments.

In 1978, Dr. Margaret Dayhoff published the PAM matrix.

She manually aligned several groups of sequences with less than 1% sequence difference,

counted the frequency for each type of amino acid changes, and constructed a matrix called PAM 1.

Then, she applied ideas and assumptions in Markov chain (which would be introduced in later lectures),

regarded PAM 1 as one transition step in evolution, did PAM 1 matrix self-multiplication, and got several scoring matrix,

such as the commonly used PAM 30 and PAM 70, for alignment between sequences with more differences.

Later, in 1992,

Dr. Steven Henikoff and Dr. Jorja Henikoof did similar work based on much more multiple sequence alignments.

They focused on some conserved segments in the sequences,

and worked out several BLOSUM matrices.

The most commonly used scoring matrix for protein alignment nowadays may be the BLOSUM 62 matrix.

For PAM matrix, why we can self-multiply PAM 1 to get scoring matrices for sequence alignment with more sequence difference?

This might be related to the ideas and assumptions of Markov chain.

We self-multiply the transition probability matrix, which will be briefly introduced here and may be introduced in detail in later lectures.

In this example on the slide, we only consider three types of amino acids, denoted as A,B,C.

If the transition probabilities for one evolutionaly step (1% sequence difference) are shown in the table,

how to calculate the transition probabilities for two steps?

Think about the transition probability for two steps starting from A and ending at A.taking one step to A,B,or C,

It is actually equal to the summation of the probabilities starting from A,

and multiplying the corresponding probabilities from A,B,or C one step back to A, respectively.

Similarly, the transition probability for two steps from A and to B,

is equal to the summation of the probabilities starting from A, taking one step to A,B,or C,

and multiplying the corresponding probabilities from A,B,or C one step to B, respectively.

Students who are familiar with linear algebra may have got the rules:

the calculation process containing probability multiplying and summation is just consistent with the definition of matrix multiplication.

That's why we say the square of PAM 1 is PAM 2.

Also, please note that PAM 2 does not mean 2% sequence difference.

With the addition of reverse mutation, the corresponding sequence difference is less than 2%.

Similarly, we can calculate PAM 30, PAM 70, or PAM 250.