Hello everyone, this is the supplementary video for this week. I will explain and illustrate the basic ideas and usage of three NGS analysis softwares: BWA, samtools, and GATK.
I will first introduce the sequence compression and alignment algorithm employed by BWA, the read mapping software.
Then I will introduce the common softwares for variant calling and genotyping. Last but not least, I will illustrate how to use BWA, samtools, and GATK from the command line.
The BWT algorithm is the compression algorithm used by BWA. Its compression is lossless. By sorting and transforming the character matrix rotated from the initial string,
this algorithm could achieve a much higher compression ratio.
It will search for identical(matched) regions using reverse-char method. It cannot handle gaps.
Specifically, here is a demonstration of BWT algorithm.
Move up elements of the input sequence, i.e. the first column sequence by one position, and put the original first element at the end of the string. We will then get the second column sequence.
Repeating this procedure again and again, we will get a 7-by-7 matrix displayed at the left of the slide.
Then we treat the rows of this matrix as strings and sort them alphabetically to transform the matrix from the left one to the right one.
Then we take the last column of this matrix to get the sequence below.
Now we can see that the first row in the left matrix is the second row in the right matrix. So using BWT we get an I=2 and a transformed sequence for this sequence.
Next we need to reverse this sequence back to the original sequence.
As described earlier, the left matrix is constructed by repeating moving up elements in the current column by one position and appending the original first element at the end of the column.
This means that the transformed sequence and the original sequence are made up of the same characters
And the first column in the right matrix is generated from sorting the rows of the left matrix. So we can recover the first column by simply sorting the last column of the right matrix. We will call the first column Column F, and the last column Column L.
Why do we need to do this?
The reason is that the transformed (right) matrix has two useful properties. The first property is that, for each character X of the Column L sequence, the character that succeeds X in the input sequence is just the character in the Column F that is in the same row as X is.
The second property is that, for each character that occurs more than once in the input string, if we label them 1,2,3,… by the order they occur in the input string, we will find that these labels have the same order in Column L and Column F.
The first property is relatively easier to understand.
For the second property, please note that, for example, the first G in Column F is also the first G in Column L. In other words, the G at Row 5, Column L is the same G as that at Row 2, Column L.
As we know that the input sequence is the second row of the transformed (right) matrix, the G at Row 2 in Column L will thus be the last character of the input sequence. Now we can recover the input sequence using the two properties just mentioned.
The order by which we recover the input sequence is displayed on the right.
For example, the input sequence is the second row of the transformed matrix. Since the last character of the input sequence is G, which is also the first G in Column L, the position of this particular G in Column F will be Row 5. The preceding character will thus be the character at Row 5 in Column L, which is the second G.
Repeating this procedure again and again, we can recover the input sequence by the order displayed on the right. However this order is reversed.
We have just introduced the BWT algorithm. So how could we apply this algorithm to sequence alignment?
Assume we have another input sequence TACG. Now we need to find subsequences that are identical to this input sequence in the original sequence.
The last character of this input sequence is G. We can see that there are two Gs in Column L, so we will start searching from these two Gs.
First, the G at Row 2. After two steps of searching we will find that it is preceded in the original sequence by G, not C. So we stop here.
Now we try the G at Row 5. Following the characters as described above results in a subsequence that is identical to this input sequence. So we find it.
This is the basic idea of BWA algorithm.
If we get the Column F and Column L of the reference sequence to search, and build indices, we can map the reads very fast.
Let’s try again with another input sequence GGAC.
If you’re good at computing, you will still find a subsequence that matches.
This is because this method of recovering the sequence cannot tell whether the current character is at the start (or the end) of the original sequence. Therefore it cannot tell whether the match crosses the start and end of the original sequence.
Nevertheless this can be easily resolved by introducing a way of discriminating the “start” and “end” when building indices.
As shown above, we add a $ at the end of the original sequence. Let’s see what we will get.
The Column F and Column L become this. Searching for GGAC again will abort at the third step, leaving the problem resolved.
That’s all for the BWA/BWT algorithm.
Next we will introduce some common softwares for variant calling and genotyping.
samtools is one of the early common NGS tools for processing SAM/BAM files. The SAM/BAM files are the result of mapping reads to the reference genome.
This tool has also facilitated the development of variant calling and genotyping algorithms based on BAM files. The program mpileup and bcftools can accomplish these tasks.
Most genotyping algorithms are based on Bayesian statistics.
The diploid genome can have three different genotypes, two different homozygous ones and one heterozygous one.
Considering the base distribution and base quality profile at the locus deduced from the reads, we can infer the genotype by choosing the one that is most likely to exist (i.e. the one whose posterior probability is the largest among the three genotypes).
GATK is a recently developed tools from the 1000Genomes Project. It is often used to call variants from the genome sequencing data of one or more healthy people.
Currently the common variant callers listed on its website are the UnifiedGenotyper and the HaplotypeCaller. We will illustrate the usage of UnifiedGenotyper only, as it is more commonly used.
This is the analysis pipeline for DNA sequencing data recommended by the GATK website.
First, the reads are mapped to the reference genome by BWA or similar tools.
Then we can use the “indel realignment” model to re-map the reads around indels again and more carefully (as the mapping around indels is often poorly performed and raise false positive variants easily).
We can also use the “base recalibration” model to estimate and possible correct the base quality profile given by the sequencing machine.
After these preprocessing steps, we can use the “UnifiedGenotyper” model (or others) to call the variants and the genotypes.
After filtering the variants with respect to their qualities, we can start the downstream analyses such as the variant annotation which might be introduced next week.
That’s the slides for this supplementary video. Now let’s see how to use them from the command-line.