[MUSIC] Hi. Today we will talk about the arms race between the researchers trying to develop an HIV vaccine, and the rapidly evolving human immunodeficiency virus. And this discussion will bring us to the challenge of comparing highly diverse viruses from different patients, or comparing highly diverse proteins from species separated by millions of years of evolution. We will learn that the traditional sickness alignment algorithm that we studied before often fails to detect subtle similarity between highly diverged sequences. To address this limitation, we will develop a new computational framework called Hidden Markov Models, whose applications in bioinformatics will extend well beyond sequence comparison. We will start from classifying HIV phenotypes, and asking the question of what HIV phenotype Gaëtan Dugas had. Gaëtan Dugas was a flight attendant who traveled the world and engaged in sexual liaison with hundreds of men. In 1984, he has become the most studied HIV patient in the world, when biologists constructed a network of 40 HIV patients whom he infected with his HIV virus. He died in the same year and was named AIDS patient zero, although, of course, he was not the first HIV patient. In the same year, the U.S. Health Secretary, Margaret Heckler, announced that an HIV vaccine is about to be developed. 13 years later, Bill Clinton said: "It's no longer a question of whether we can develop an AIDS vaccine; it is simply a question of when." Well, we still don't have an HIV vaccine today, and to understand why we don't have it, it's important to figure out how HIV evades the human immune systems. Vaccines are often made from virus surface proteins to train the human immune system to recognize vital envelope proteins and destroy them. However, the strategy didn't work for HIV viruses because HIV evolves so fast that there is actually a multitude of various HIV subtypes. This slide shows ten sequences from HIV envelope glycoprotein gp120 that shows a large number of substitutions, insertions, and deletions within a single patient. HIV viruses in a single patient evolve with a very fast rate of 2% per nucleotide per year. HIV strains in different patients are so diverged that they require different drug cocktails. Today we will talk about one specific HIV phenotype called syncytium-inducing phenotype. HIV has the ability to trick human cells to fuse together together to form one giant cell. Why would the HIV virus do such a strange thing? Because it's easier to kill all the cells once they are fused in a single cell. The question we will be interested in today: "Given an HIV sequence, can we figure out whether it has a syncytium-inducing phenotype or not?" And in this slide, you see 20 sequences from protein gp120 actually, from a spectacularly conserved region of this protein called the V3 loop. Despite the fact that it's conserved, the sequences have many mutations and even have different lengths, so we would definitely need to align them. We know that six of these sequences have syncytium-inducing phenotype, but what sequence features define this syncytium-inducing phenotype? Biologists noticed that in HIV viruses with the syncytium-inducing phenotype, the amino acids at positions 11 and 25 of the V3 region are Arginine and Leucine, and it has become the first computational rule for classifying the syncytium-inducing phenotype. It later turned out that the classifying rules are much more complex, but we will not go into the details of the more complex rules. What we will, however, notice is that when we construct this alignment and build a profile and sequence logo from this alignment, we will see the conservation of sequences is extremely non-uniform across different positions. For example, positions 11 and 25 are particularly poorly in this case. And this brings one challenge and one question: the challenge is to predict an HIV phenotype. It is very important to align sequences correctly. In our case, alignment of the V3 region is not difficult. We can figure out that all V3 regions belong to the same type of proteins. However, aligning them correctly so that all amino acids belonging to columns 11 and 25 appear in the same column may become a challenge. And it brings the following question: Was it a good idea to use the same scoring matrix across different columns of an alignment, something we have done in our previous studies of multiple alignment? And a brief look at this sequence logo tells us that we should develop a new statistically solid problem formulation for alignment that will use a different scoring approach at different columns.