In this week's lab, we'll be BLASTing some more, and we'll also be touching on comparative genomics. So when we BLAST, we're identifying similar sequences, and why would we want to do that? Well, similarity is the primary predictor of homology, and homology is the primary computational predictor of gene function. Now there are other kinds of methods that are used to predict gene function such as gene expression data, but really, the cornerstone of bioinformatics in terms of function prediction is homology. So we can use sequence alignments to allow us to identify similar sequences. What this means is that when we take two sequences and align them, what we're doing is we're actually aligning the amino acids that are similar or identical with the other amino acids in the other sequence, such that they are stacked one on top of the other. So the plus signs here denote that the amino acid residues are identical between the sequences, and the dot would indicate that these amino acids share similar properties. You can see that gaps have been inserted to allow the amino acids that are identical or similar to stack on top of one another quite nicely. So in today's lecture, we're going to talk about substitution matrices, which allow us to do sequence alignments. We'll talk about a couple of different alignment methods, the dot matrix alignment method, which is a very simple method. In terms of local alignments, and there's a whole body of literature on sequence alignments, but we're focusing on local alignments today. We'll focus specifically on BLAST, which is a heuristic alignment method for aligning sequences. We'll also talk about the evaluation of significance of sequence alignments, and we'll talk again about comparative genomics. So substitution matrices are scoring systems that model sequence change over evolutionary time, and they are somewhat realistic models of sequence evolution that encompass substitution biases and mutational saturation. In terms of substitution biases, what this means is that when we look at the codon table for amino acids, there are 64 possible codons, and several of these codons encode for the same amino acids. So here leucine is encoded here by six codons, whereas methionine is actually only encoded by one codon, ATG. So this needs to be taken into account in our substitution matrix. The other thing that needs to be taken into account is the fact that some amino acids have similar physico- chemical properties. So these three amino acids here, isoleucine, leucine and valine are aliphatic amino acids, and when you see them at a similar position in protein, you can assume that they're serving a similar function because they share similar properties. The other thing we need to take into account is mutational saturation. So in this cartoon, what we're looking at here on the X-axis is the mutations per 100 residues, and on this axis is the differences per 100 residues. Naively, you would think that as the number of mutations per 100 residues increases, there would be a linear increase in the number of differences per 100 residues. However, a given point, a given nucleotide can change many times, especially at longer evolutionary distances. So the graph actually looks something like this, and especially at longer/ greater evolutionary distances, we sort of enter a twilight zone, where we don't quite know the number of times that a given site has been mutated. So what we need for substitution matrices is a scoring system that model sequence change over evolutionary time, we need to favour identical or related amino acids, we need to penalize poorly matched amino acids or gaps, and we also need to take into consideration the relative abundance of amino acids and proteins. So alanine occurs quite often, you need to take that into account. Two commonly used amino acids in substitution matrices are the PAM - the Point Accepted Mutation - matrix, and the BLOSUM - the Blocks Amino Acid Substitution - Matrix. Both of these are derived from trusted sets of alignments of closely related sequences. In the case of these matrices, what you're seeing here is on this axis here. We've got all the amino acids possible, and then on this axis, we've also got all the possible amino acids. Along the diagonal, we have positive scores that we use actually when we're aligning sequences. That's the same for both the BLOSUM matrix, as well as the PAM matrix. If we zoom in on part of the PAM matrix, we see what I was alluding to before, that certain amino acids such as alanine-alanine, or certain amino acid comparisons, get scored less highly than others. So the cysteine-cysteine comparison actually receives a very high score because cysteines can form disulfide bridges, which are really important for maintaining the structure of proteins. We also see here that, for instance, D-N receives a positive score, whereas others, D-A and a D-R, actually receive scores of zero or minus one. So they're scored negatively if we see them in a sequence alignment. There are different flavours of PAM and BLOSUM matrices, and we use the flavour that's appropriate for the divergence time. So if we're looking for distantly-related sequences, we would employ PAM matrices with a higher number, or BLOSUM matrices with a lower number. If we're searching for sequences which are less divergent, we would use a PAM matrix with a lower number, or a BLOSUM matrix with a higher number. So how can we generate an alignment? Conceptually, dot matrices are the simplest way to think about alignments. What we're doing with a dot matrix alignment is we're taking the one sequence, in our example here, perhaps this sequence, and we're breaking it into small chunks of, say, 10 nucleotides. Then we're asking for each 10 nucleotide chunk in the other sequence, how well does our first chunk match to all of the chunks in the other sequence? If it's above a certain threshold, then we put a dot at that position in our matrix, and we can see in this particular example that each dot follows the other along the diagonal. And that means that these two sequences are pretty much identical. However, the one sequence has a gap in it relative to the other sequence at that position, and conversely, the second sequence has a gap in it relative to the first sequence at this position. So these patterns of diagonals, and gaps in the diagonal can tell us how these sequences, how well related they are, one to another. So if we look at the second example here, and we would see a pattern like this, what we would say is that, "Oh, these two sequences are highly related or are, in fact, identical." But this interesting pattern here comes from the fact that there are actually repeats within the sequence. Here, we have our initial sequence and the match sequence. So this particular comparison, we generate the long diagonal line, whereas these smaller comparisons here, where we're shifting the match relative to the comparison sequence, generate these smaller diagonals here. So these patterns of diagonals can really help us to gain insight into how the one sequence is related to the other. Although we can use this dot matrix method to align sequences, it's not really optimal because no kind of statistical evaluation is possible. It's also exhaustive, and therefore, time consuming. So heuristic methods can allow us to explore a problem such that feedback from each round can be used to guide future analytical direction. We only search a small fraction of the cells in the possible search space, but we still end up looking at all the high-scoring alignments. While they're not guaranteed to find an optimal solution, they are much, much faster. This is really important if we want to search against, let's say, the GenBank non-redundant database containing hundreds of billions of nucleotides. So there are two best known approaches, BLAST, which we'll be talking about today, and FASTA. BLAST stands for Basic Local Alignment Search Tool, and that's what we're focusing on, these local alignments. What it does is it heuristically finds high scoring segment pairs between a query sequence and a target database. The concept here is that true matches very likely contains short stretches of identities, and these short stretches can act as seeds for extending the alignment, and the short seed sequences permit preprocessing of query. So basically, BLAST creates an index that computationally, we can use this index to very quickly identify seeds from which to extend. The trade off here is sensitivity for speed. So the concept in pictographic terms looks like this. We take our query sequence, and we look for these short matches within a database sequence. Then from those short matches, we extend outward from them. At some point, we don't extend any further after a threshold is reached, and then we can join up the HSPs to get our overall alignment. So the first step in BLAST is the list step in which we extract words from a query sequence and make an expanded list of related words. So for amino acids, it might look like something like this. So we would, let's focus on this, HEA here. We can break our query sequence up into three-letter words, for instance. In this case, as I said, we're focusing on HEA, the score for HEA based on the BLOSUM62 matrix for an exact match to HEA, would be 17, and those come from the cells in the matrix. We also have synonyms which can score above the threshold, the threshold word score being 13 in this example. So HDA or HQA, HKA, and so on, all score above the threshold, and all the other three-letter synonyms are unacceptable. So YEA, for instance, would not be considered a synonym because it scores below the threshold word scores. So that's one of the parameters that we can adjust in BLAST, is threshold word score. So we would take our expanded word list, synonyms from the previous slide: HEA, HDA, HQA, and so on. We would search through the sequence database using this index that BLAST has created, for exact matches. So HQA appears here in this particular sequence, HEC appears in this particular sequence, HEV appears in this particular sequence, and so on, and then we can now identify sequences that contain the seeds. So for one particular match sequence, we then align at that particular seeds. So HEA in the case of the query sequence, and then we align it to the HQA, the seed, and then we extend out- ward from that particular seed to identify an HSP. So once we go beyond a certain threshold, if we score here, we would see that, in fact, this falls below our threshold, so we can't really extend the alignment any further. That's what this says here. So this information for an HSP is actually returned in the BLAST scoring. You'll see that in the BLAST output, we've got a score of 42 bits, We've got a certain number of identities, and so on. So BLAST also actually will insert gaps between seeds to allow chunks to be connected together. Again, it's a heuristic, what we do is we start with the seed, and from there, we only search a small part of this search space that seems to be appropriate for inserting gaps. But just keep in mind BLAST does allow gaps when searching. There are different flavours of BLAST programs. There's blastn, blastx, tblastx, tblastn, and blastp. So blastn is used when you're querying a DNA sequence against a DNA database, and what's returned is a DNA alignment, there's one search performed when you use blastn to find homologous DNA sequences. There are a couple of translated BLAST flavours, tblastx will take a query DNA sequence and a DNA database, and it will return a protein alignment. So what has to happen here is that the DNA sequence here needs to be translated in all six possible reading frames, so starting at the first nucleotide, the second nucleotide, and the third nucleotide, and then also going in the other direction from the three prime end of the DNA backwards for translations four, five, and six, and it does the same thing with all of the DNA sequences in the database, and then it aligns the protein sequences, these translations. So in fact, it has to do 36 different searches, and what you can use that for is to find homologous proteins from unannotated DNA query and database sequences. We can use blastx which takes DNA query, translates it into all six possible reading frames, searches against a protein database, and returns a protein alignment, and here we're performing six searches. One for each of the translations, each of the reading frames. The converse of that is tblastn, where we take our DNA database translate it into all possible translations, use all possible reading frames, and compare it against a protein sequence query to get a protein alignment back. Again, here, we're doing six searches, and here, we can find homologous proteins in an unannotated DNA database which you might use if you've got an EST database or a shotgun database from a new genome project. The final flavour of BLAST that's used very often is blastp. Here, what we're doing is we're comparing a protein sequence against a protein database, and what we get back is a protein alignment. Here, only one search is done because the translation has already been done for us, and we can use blastp to find homologous proteins. So there are various databases that are available at NCBI. There are protein databases. So we've got the nr database which is the non-redundant GenBank Database, Swiss-Prot is a higher quality manually curated database, there's a patent database, monthly updates, Protein Data Bank database, which is the collection of all proteins that have structural information from the PDB. There are a bunch of different nucleotide databases, nr, est, the patent...collection of patent nucleotide sequences, there's chromosomes, whole genome shotgun sequencing. It's also important to know what database you want to search in. So typically, when you're just doing an initial search, you probably use the nr database, which is the collection of all sequences available in GenBank (with some exceptions!). So another flavour of BLAST that's quite useful, especially for identifying distantly-related sequences, is PSI-BLAST which stands for position-specific iterated-BLAST. The reason for using PSI-BLAST is that motif or profile search methods are often more sensitive than pairwise comparisons at detecting distant relationships. So the way it works is to create multiple sequence alignments from the BLAST output. So PSI-BLAST does this automatically for us. It then uses that multiple sequence alignment, which we'll be talking about next week, to create a position-specific scoring matrix (PSSM). This PSSM is then used to score the BLAST search instead of the PAM or the BLOSUM matrix, and then we use this new PSSM to iterate through and each time we update the PSSM to include new members that have been identified with the previous PSSM. The reason this is powerful is that keep in mind how the PAM and BLOSUM matrices were initially generated... they're generated from a curated set of fairly closely related sequences, so they might not capture the features of the particular sequence that we're interested in. So this just shows why we might use PSI-BLAST. So look at this alignment here at the top. We see at this position here that R, I, and K are fairly well conserved. But if we were to look at their scores in the BLOSUM matrix, in fact, they are negatively scored in some cases. So for some sequences at some positions, certain substitutions are allowed, and PSI-BLAST allows us to capture that, and this matrix here just shows that some of the changes that we see after position-specific alignment aren't reflected in the typical PAM or BLOSUM matrix. So how do we evaluate the BLAST results? The main question is, is a database sequence homologous to the query? We can use significant expect values, we can also do reciprocal best hits, and ask the question whether or not the sequences are similar sizes, do they share common motifs, is there reasonable multiple sequence alignment, do they have similar 3D structures, and how can we ask whether one database hit is better than another? We could use sequence identity, was just counting up the number of amino acids that are identical between the two sequences. But it's really not advised because the distribution is not well understood as the alignment length increases here and the percent identity within the alignments decreases. There's not really a linear behaviour. There's a false positive rate, it ignores gaps, and conservative versus radical substitutions. So it's not really a good metric. We could use the bit-score and BLAST does report two bit scores, the S and the R. The raw bit score for DNA sequences looks like this where a is the reward for each identity, I is the number of identities in the alignment. b is the "reward" for each mismatch, X is the number of mismatched residues, and then we subtract the gap opening penalty (c) times the number of gaps (O), and then we subtract the penalty for each gap (d) times the length of the gaps (G). In the case of DNA sequences, the a and b defaults are one and minus two for Blastn, and then a slightly different formula is used for protein sequences where we use the scores from the substitution matrices for generating our raw bit score. We can also adjust the gap opening penalty and the gap extension penalty in BLAST. Normalized bit scores are better to use because there are a couple of different parameters that are introduced, scaling factors which convert pairwise match scores to probabilities, the lambda. K is a proportionality constant to correct for the number of sequence comparisons. So for large databases, this means we can compare these scores between databases better. They make the bit scores and the E-values independent of the scoring system. These parameters are available from the BLAST search summary. The E-values are the best measure of significance. It converts a bit score into a probability. The E-value depends on the bit score, as you can see from the equation. It also depends on the effective length of the query which is m and also on the effective length of the database, n, just the number of nucleotides or amino acid residues in the database. Basically, the E-value is the probability of finding a database match to your query as good or better by chance. So how good is my hit? Can we use identity? No. We could use the bit score, it's better. But using the E-value is the best thing. This is the probability value that is based on a number of different alignments with scores at least as good as that observed which are expected to occur simply by chance. The lower the E-value, the more significant the score. Just keep in mind that the E-value is dependent on the size of the database and on the length of the query sequence, so that if you search the same sequence in different databases and this database is contained identical hits sequences, those would result in different E-values being reported for those sequences. So what is a good E-value? So a good E-value is typically considered to be 10 to the minus 20. So if we look at this particular example from Mount's book, Bioinformatics, we have our query sequence here, these green and black boxes. If we had an alignment that looked like this, that also had a green and a black box in terms of, say, motifs, our E-value would typically be less than 10 to the minus 20. If one or the other of those regions were present, our E-value would be in the range of 10 to the minus eight to the 10 to the minus 20. And if we only had fragments of the one match present, we might fall into the E-value of range of 10 to the minus six to 10 to the minus eight. So you can use 10 to the minus 20 as an empirical cutoff for the sequence being a match in the database. I'm moving to orthology and paralogy. We talked about this a little bit last week. If we have an early globin gene ancestor of those three species indicated there, frog, chicken, and mouse, and then there's a gene duplication event in that ancestor and subsequent speciation events, keep in mind that the frog alpha, chick alpha, and mouse alpha chains are orthologs of one another. The beta chains are also orthologs of one another, whereby the alpha and the beta, the mouse alpha and the mouse beta are paralogs of one another. Altogether, all of those are known as homologs. So orthology can be used to identify conserved residues within genes and proteins. We can also use comparative genomic methods on intron sequences and promoters to identify parts of these that are important for function. So how can we determine orthology and genomic sequences? Well, we can use TBLASTX, (translated BLAST) or BLASTP. We've got the translations, take the reference genome BLAST against the other genomes and take the gene with the best E-value as the orthologous region. The problem is what happens if Blasting in the other direction identifies a match in the reference genome that is better, what is the ortholog? We can also use the reciprocal best hit method which addresses the above issue. This can be confounded by rampant domain swapping. This happens quite commonly in eukaryotic genomes. We get lots of false negatives. We might want to use phylogenetic-based methods such as RIO or Orthostrapper or RSD or BLASTP based methods such as InParanoid, OrthoMCL or KOG: these used BLASTP followed by Markov or other clustering methods. These methods are constantly evolving and each has its advantage or disadvantage. If we're interested in having a low false negative rate, for instance, when a false negative rate is on this axis versus the false positive rate on this axis, we might want to use a method like OrthoMCL or KOG or InParanoid. However, if we don't want a lot of false positives. and we're willing to accept a few more false negatives, we might use something like RIO in that case. So there are several ortholog databases that are available. We could use Clusters of Orthologous Groups, (COG) and EuKaryotic Orthologous Groups, KOG, from NCBI. This contains orthologs for several species, it's slightly older. HieranoiDB has orthologs across 66 species. Again, slightly older. OrthoMCL DB contains orthologs for many species. Again, slightly out of date. The InParanoid database has orthologs for 273 species. That's from 2013. CoGe seems to be the most up-to-date database for orthologs, it encompasses 50,000 genomes. It's very up-to-date. It's got some nice tools for synteny. You may find other orthologous databases but the question that you might want to ask yourself is, how up to date these are, and what genome versions do these databases use, and those kinds of questions. So there are a couple tools that we'll be exploring for comparative genomics and genome browsing using GBrowse/JBrowse, which is a standard tool for many model organisms. Often in GBrowse instances, we'll have tracks that show us the homology between our given genome of interest to other species that have been sequenced. We can see the degree of homology in these tracks and we'll be exploring that in the lab today. We can also use something called ACT, which is the Artemis Comparison Tool. This allows for cross genome comparisons. It's very useful for identifying syntenic regions. What is synteny? Synteny is the preservation of gene order on chromosomes of related species. So during evolution, genomic rearrangements can separate two loci. The result is a loss of synteny between them. Conversely, translocations can also join two previously separate pieces of chromosomes, which is a relatively rare event. In that case, we would have a gain of synteny between loci. A synteny can be useful in the case of many-to-many or one-to-many ortholog mappings for determining the "true" ortholog. It allows also for the identification of translocations and inversions, which might be important for understanding the biology of the organism. These would show up as X shaped figures in ACT. So here we see it against small X here, and here's also seems to be an inversion here in these two regions. So we've talked about substitution matrices, the theory behind them, what they try to capture in terms of evolution. We've also talked about alignment methods, and specifically the dot matrix method for comparing two sequences, as well as BLAST. Then we've talked about evaluating significance...that we should use E-value to determine whether or not a certain sequence is a significant match. Then we touched on comparative genomics. Hope you enjoy the lab, and will see you next time.