We just discussed edit distance which allows us to find the edit distance between two strings using a powerful dynamic programming algorithm, so now we want to apply this same sort of idea to the approximate matching problem, to the problem of finding approximate occurrences of a pattern P in a text T. So shown here is the sort of matrix we talked about in a previous lecture for finding the edit distance between x and y, and here the characters of x are labeling the rows of the matrix and the characters of y are labeling the columns. But now we're going to imagine a new kind of matrix like this where the rows are labeled with characters of our pattern P, and the columns are labeled with characters from our text T. And, one change that we'll make to our edit distance algorithm, is that we'll initialize this matrix a little bit differently, so before we initialized the first row and the first column with ascending integers, zero, one, two, three, etc. This made sense because if we're talking about the edit distance between, on the one hand, the empty string and on the other hand a string of length three, for example, then that edit distance has to be three. For approximate matching we're instead going to initialize the first row with all zeroes. The first column we'll initialize as we did before with ascending integers like this. So why do we initialize the first row with all 0s? This is because we don't know ahead of time where P is going to occur within T, so every offset here is equally likely, and so by filling the first row with all 0s we're not biasing ourselves toward any particular offset where P might occur within T. If this point isn't obvious right now, it'll probably become clearer after the description of the algorithm. So once the matrix is initialized, we simply fill in the rest of the matrix in exactly the same way as we did for the edit distance problem, and now that we've filled in the matrix, we'd like to identify where are these approximate matches of P within T? So, in this case, by looking at the matrix, we learn that there must be a two edit occurrence of P within T, that that is the occurrence with the fewest edits. So we noticed by looking in the final row of this matrix, and in the final row we find a 2. And that tells us that there must be some substring of T that matches our pattern P with two edits. 2 is the minimal value in the final row, so this corresponds to the closest match between P and some sub string of T. So, that tells us that P matches somewhere with two edits. What we don't know yet is we don't know how to figure out exactly where that substring of T is that it matches. And to answer that, we first have to answer the question, how did we get here? How did we get down to that value in the bottom row of 2? So in the first step, we can wonder something very specific. So how did we put this 2 in this particular element when we calculated the minimum over those three terms? So we ended up with a 2, but in this case it must've been that we came diagonally from above and to the left. So had we come either horizontally or vertically, the value would've been 4, because the values above and to the left are both 3, and we have to add 1 in both cases to account for that final gap. So if we had come horizontally or vertically, the answer would be 4, but because it's not, we can conclude that we must have come diagonally. So that means that we got to the blue cell from this red cell. So now we can ask this same question of that red cell. How did I get here? And again, we can look at the vertical and horizontal and diagonal contributions. The vertical and horizontal contributions both would be 3 because both the cell above and the cell to the left have a 2 in them. We would have to add 1 for adding on that final gap, so they would both contribute a 3. But the movement from diagonally from above and to the left would only add one, so we must've come diagonally. And so we can repeat this kind of procedure to go cell by cell backwards, recreating the way that we got the value in each cell. Once we get to this cell here, we'll ask the same question, and instead of concluding that we came diagonally, this time we'll conclude that we came vertically. So we'll look again at those three contributions. This time the diagonal contribution actually is greater than the vertical contribution. The diagonal contribution would be 2, because there's a 1 in the cell diagonally above and to the left, and the delta function would evaluate to 1 in this case because the C in the text mismatches the G in the pattern if you look at the character labeling that column and the character labeling that row. And yet the cell that's vertically above has a 0 in it, and so we would add 1 to account for the final gap. And that would be the minimal value. So in this case we came from above, but we can then repeat this procedure to get all the way back up into the very top row. So now we know the full answer to the question, how did we get to this 2 here in the bottom row? We followed basically this path shown here with the purple arrow. So this is called, this path, is called the trace back, and the procedure that we followed to find this path is called tracing back. And it's as though we're sort of retracing our steps to figure out how we got to the value that was placed in that cell in the bottom row. Once we have the trace back, we know exactly which substring of T matches with P with two edits. It's the substring that begins at offset 5 with respect to T. We know this because this element here is the first one where we've aligned a character from T to a character from P. So in this case we've aligned this G, the G at offset 5 in the text T with the very first G of P. The other thing it tells us is the shape of the alignment, specifically, it tells us where the differences are between P and T. So for example, the vertical part, of this trace back here, corresponds to the only gap in the alignment, and so we can write down the alignment that corresponds to the trace back, like you see up here. And here I've highlighted the matches in green, and the differences in red. And so here is the gap that corresponds to the vertical move in the trace back, the vertical step in the trace back. So the overall shape of this alignment follows the overall shape of this trace back here. So this is how we use edit distance to solve the approximate matching problem. We fill in a matrix just like this, and we can look in the final row to detect occurrences of P in T with some number of edits, and then we can use the trace back to identify the exact location where that match occurs and what the shape of the alignment is. So, a big problem with methods like this is that they can be quite slow. So, the amount of work that we had to do to solve this problem was proportional to the number of elements in the matrix which in turn is proportional to the number of characters in P times the number of characters in T. So this can add up to quite a lot of work, and if you compare the amount of work we're doing here to the amount of work we were doing with Boyer-Moore, the amount of work we did there was proportional to the length of the text or approximately, it was just proportional to the length of T. So in fact for some variations of Boyer-Moore, the amount of work we do is strictly, at worst, linearly proportional to the length of the text, and then furthermore, the longer the length of the pattern with Boyer-Moore, the more skipping that we could do. So, this is not really a fair comparison because Boyer-Moore is only capable of exact matching, at whereas what we're doing here, this algorithm that we just learned is capable of approximate matching, but this makes this algorithm potentially an impractical thing to use on its own for solving problems like for example the read alignment problem. In practice, edit distance-based approximate matching tends to be used in combination with other techniques like techniques that we've learned about using an index, using the pigeon hole principle, things like this. So, by combining those two kinds of techniques we can get the flexibility of this edit distance algorithm, which naturally handles both mismatches and gaps for example, and we can combine that with the speed that comes from using an index or otherwise using the pigeon hole principle to quickly filter out a large number of alignments that won't end up with a good edit distance.