0:03

We're going to finish by looking at an application of solving

more difficult algorithmic problem to illustrate

how sorting can play a role in so many important applications.

So, the idea is,

you've got a string,

it might be a very, very long string,

millions or billions of characters.

And what you want to do is find

the longest substring in that string that appears at least twice,

so longest repeated substring problem.

And one motivation for this is from genomics,

where we have a long string taken from the ACGT alphabet.

So, in that case,

there's a substring of length five that appears twice in that string,

and that's the longest one.

So, maybe, you saw that on, here's another one.

Now, you can, see that this can be a challenging problem,

even for humans to solve,

and even for a computer, it's a challenge.

In this case, they overlap and maybe we would not allow overlaps,

but we'll allow it.

So, there's two examples of finding the longest substring that appears at least twice.

Here's the first hundred digits of pi.

Are there long repeats in that?

In this case, not too many.

The longest repeat in the first hundred digits of pi is just three.

So, those are some examples to illustrate this problem.

Here's another example, people study music to understand

the repetitive structure and perhaps compare different musical compositions.

So, this one, there's an arc connecting parts that repeat.

This is, Mary had a little lamb.

If you hymn that to yourself,

you can recognize this repetitive structure.

Here's another one that's much more intricate,

and maybe if you study that for once,

it's pretty interesting the way that it builds.

There's the big arc across,

but the long ones in the middle are kind of build up, quite interesting.

That one is Fur Elise.

And so, you can understand how this idea

of things repeating is important in lots of contacts.

But really, the reason that people look for

repeated sequences in the real-world is that it has something to do with causality.

It means there's maybe something there.

So, for example, is there a difference between the digits of pi and random digits?

That's kind of an interesting philosophical question.

We know it's not rational as transcendental,

but is that the same as random?

The answer is no,

but we can't tell the difference,

any test that people have performed on the digits of

pi would pass the test of randomness.

So, for example, if you build

a mathematical model for the length of the longest repeated subsequence,

say, it should be about 14 for 10 million digits,

then that's what it is for pi.

Cryptography, longest repeated subsequence

is important if you have data that's supposed to be encrypted.

One thing that cryptanalysts do is find the longest repeated subsequence.

And why do they do that?

Well, sometimes, the people doing

the encryption make a mistake and repeat certain header information.

And if you read about cryptography,

you can learn about famous examples of this.

So you find that repeat,

assuming something about the header information,

that's one way in to break a code.

And the other example is DNA.

In that case, if you've got a really long and repeated thing,

then people assume it's kind of junk and really we

want to look somewhere else for what matters.

And these strings can be really long.

So, just one chromosome might have millions of nucleotides.

So, we want to be able to have an algorithm that scales for solving this problem,

for all three of these applications.

So, let's take a look it.

So, here's a simple warm up called the longest common prefix problem.

So, we have two strings,

and we want to find the longest substring

that appears at the beginning of both of those strings,

and we'll call that LCP.

And this one is really easy.

You just compare character by character until you find the thing that's not equal.

So, here's the code for that.

Whichever is the shorter string,

you use that as the termination of your for loop,

and you just go through.

If you find a place a character,

where they're not equal the first time,

that's a substring you return.

And if you get through the whole thing,

then you just return the two substrings.

So that's the LCP,

takes two strings as argument and returns

a string which is the longest common prefix of the two strings.

And we'll use that as a helper method in our solution to longest common substring.

So, just building on that,

just our first approach to solving

the longest repeated substring problem is a brute-force method.

And again, often, we solve problems with

simple methods to make sure that we understand them and also to

check on small cases are

more sophisticated algorithms or maybe

we may have small cases that we need to solve in the first place.

So, what do we want to do to find the longest repeated subsequence?

Well, we take our string that we take as input,

and we're supposed to return a string which is the longest repeated subsequence,

and we pick off the length of the string and end.

And what we're going to do is,

for every position in the string,

we're going to compare.

For every position later in the string,

we're going to test whether the substring starting at I and the substring

starting at J have a longer common prefix than what we've seen before.

So we compute the longest common prefix of the substring

from I to the end and the substring from J to the end.

So that's X, and again,

that's easy to compute,

and then we just check if the length of X is

bigger than the length of the longest that we've seen so far.

And if it is, we save X as our longest repeated subsequence.

So, we're trying all pairs of positions in the string to see

the length of the longest common prefix starting at those positions,

and those are subsequences of the real string.

So, a straightforward method,

try all possibilities, and that certainly works for our simple example.

I call it LRS brute just to distinguish it from later methods.

So, the analysis of this is that it's going to take about N squared over two,

you've got nested four loops for I and J.

There's going to be about N squared over two calls on the longest common prefix method.

And in that method, it takes time proportional to the length of the prefix so,

it obviously doesn't scale.

So, we can consider using this method for something that's

got millions of characters in it as in the genomics application,

need a method that scales.

Well, it turns out that there is a very efficient solution that was

discovered decades ago that uses sorting and here's the idea.

It's called suffix sorting.

So, the idea is that we're going to form what's called suffix strings,

which is string, the substring that you get by starting at each position.

So, at position zero,

that's the whole string.

At position one, it's the string that you get starting

at position one in the original string, and so forth.

So, there's strings of length and there's N suffix strings,

just starting at the so position 10,

it's the five characters at the end of the strings CAAGC.

So, there's N of those in suffix strings.

So, that's the first step in this solution.

And then what we're going to do is sort them.

They're just strings, so we'll go ahead and sort them.

And we'll keep track of the indices that go with them,

but the idea is to just sort the strings.

So now, we don't have to check all pairs because

the ones that have a long come the prefixes are going to be together in the sort.

So, what we need to do is just look at adjacent entries,

so and then find the longest common preset among adjacent entries in the sort.

If they're not together in the sort,

they're not going to have as long a prefix as the ones that are together and the sort.

And that's easy to see when you think about it.

So, in this case,

it's the ones ACAAG come

together in the sort and that's the longest one or we can try all pairs,

we're not going to see every position.

We tried adjacent entries and we are not going to see a longer one than that.

So, that's the longest common prefix

applied to the suffixes gives us a solution to the longest repeated substring problem.

So, let's look at the implementation in Java.

So, we're going to create an array of strings called suffixes,

and then for every position in our string,

we're going to take the substring from IDN,

and put that in suffixes of I,

then use merge sort to sort it.

And then same as before,

except now we just look at adjacent positions,

we pull out the longest common prefix,

put that in the string X,

check if it's longer than the longest we've seen so far,

in set up as before.

Find the longest I common prefix among adjacent entries and

that's the implementation of longest repeated subsequence.

The thing about this method is again,

it works on our small case.

Think about this method is,

is just got N calls on substring to set up the suffix array,

and it's got N calls on longest common prefix,

so at least it's got the potential to scale and we can test that empirically.

So, and here's the empirical analysis that we used in our course for many years,

a couple of decades.

So, we're talking about genomics,

so we'll use ACTG alphabet,

and we'll take random strings,

so we can use our generator.

This is a generator that generates a million random string.

One random string of with a million characters in it,

draw from that alphabet,

which is the type of thing that the biologist need to find longest repeated subsequences,

and it took just two seconds.

And if you got something that fast you might as well multiply by ten.

So, for 10 million,

it take 20 seconds.

So, problem size increases by 10,

the ratio of the running times is about 10,

that's a method that scales.

Or if you prefer, do the doubling and see that the ratio is about two and so,

that's an empirical analysis that confirms

the hypothesis that the order of growth

of the running time of this method is about N log N,

and that time is for the sort.

And you can use the highly optimized sort,

like the system sort,

if you don't want to use our merge sort.

And this is the way that we taught it for many years.

And so, about why am I bringing that up now?

Well, that was really important discovery that you could have methods

that allowed scientists to

scale their study with the size of the input and then in the 90s,

in the early 2000s.

The amount of data was growing as the technology for bioinformatics was improving,

and this type of algorithm definitely enabled new research and development.

But a few years ago something else happened in testing this code for this lecture,

it crashed, ran out of memory.

So what's going on?

We had a change in the system that broke a working program.

And for all we know,

broke the working programs of scientists and researchers out there trying to

understand more about DNA and the genetic code.

What happened? Well, as it turns out,

there's two alternatives for implementing substrings in Java.

There's actually several, but I'll talk about these two.

So, let's say, we have this string and there's two substrings.

So, the way that it was done in Java up to 2012,

was substrings were created by reference to the original string.

So, if genome is our original string,

then it would be reference,

the actual string would be at some address X,

and that memory address would

be part of the data associated with genome along with the length.

And then the substrings could just be different addresses within that original string.

So, there's no need when you take a substring to copy any characters,

it's just a matter of saving away those indices which can't be computed in constant time.

And also take constant space just that pointer on a length.

That's the way that Java strings were designed for the first couple of decades.

But since 2012, they decided to change the representation to one where,

when you take a substring,

it would copy the characters to make a new string.

Now, the reason they did this is people would have huge strings like

the contents of a web page and they would take

only a small substring out of that huge string and hold onto it.

Even though they only needed a small number of characters,

Java couldn't reclaim the memory for

the original huge string because of this organization.

So, it allowed the potential to free up a lot of memory,

when the original string is no longer needed.

But unfortunately, what we saw is that this

requires linear time and space to implement substring.

And that's why our programs broke.

So, this is not too hard to fix,

what you can do is implement

our own constant-time operation which imitates the old implementation.

We have to implement a compareTo method to enable the sort.

But that's not too big a deal.

And the details of this are covered in our algorithms course.

So Alice, for example, is in good shape.

And if we do that and even today,

we can find longest repeated substrings in

huge strings in a very small amount of time.

So, lesson is trust the algorithm.

This algorithm was around before Java,

it'll be around after Java,

you want to trust the algorithm not the system.

And people are continuing research and development in these kinds of areas.

Just one final note,

we're running this longest common prefix.

It turns out that the running time is going to be

quadratic and the length to that longest repeat.

And in our model,

we don't have any long repeats but real data actually, has long repeats.

So, people have often developed more sophisticated algorithms than this one that

are guaranteed to run in linear time even faster than sorting.

Just to indicate the care people are taking and the

importance of scientifically analyzing

performance of algorithms in the way that we're doing it.

So, the summary is that we have an efficient algorithm to search a sorted array.

We have an efficient algorithm to sort an array.

And we have many, many things that are enabled by fast, sort and search.

And the important idea is the scientific approach

to understand the performance of our algorithms by building a mathematical model,

and validating it with empirical tests.

Very natural in all kinds of areas of science

and it's particularly important in Computer Science.

We're not just hacking to get a problem solved,

we're understanding performance and performance matters.

So, Alice has got her IPO going,

and Bob is rethinking the idea of a hackathon maybe,

he'll take a few CS courses.

So, that's our introduction to the second part of the course.