0:06

Suffix arrays are useful data structure that you already used

in the previous modules but now you will learn how to build it really fast and

first we'll recall what is a suffix array.

So the problem of construction of a suffix array is very simple you're given a string

and you need to sort all of it's suffixes in lexicographic order.

However as we will soon see you won't need to actually compute all the suffixes and

then solve them and

output all of them because that will use too much both time and memory.

You will just need to know in which order are those suffixes.

And the suffixes themselves sorted in lexicographical order

are only in our head.

They're not stored anywhere in the problem.

So we assume that the alphabet from which our strings are built are ordered, so

that any two characters we can say which one of them is smaller.

For example, in English we can order all the characters from a to z in a binary

alphabet we just have zero and one and zero is less than one.

By definition a string S is smaller than a different string T if either

S is a prefix of T or S and T coincide from beginning up to some character and

then the next character in S is smaller than the corresponding character in T.

For example, if s is ab, and t is bc.

Then they don't coincide.

But the first character is already different.

And the character in s is a.

And the character in t is b.

A is less than b, so s is less than t.

And in the second example, s and t coincide for the first two characters.

And then the third character c is less than character d.

So s is smaller than t.

And in the third case, s is a prefix of t, but

it is different from t, so s is smaller than t.

And here is an example of suffix array.

We have a string, s, and

all suffixes ordered in lexicographic order are a, aa, and so on.

So, here are exactly six suffixes because the length of string S is six,

and so we have six different suffixes.

We want to avoid this case when S is a prefix of T and that is why S is less than

T because this case is different from all others and usually you just compare S and

T from the first character and go the right until they differ.

And then see which character is smaller.

And this is a corner case when you go up to the end of the S, and

then you see that there is nothing there and so that is why S is smaller.

So to avoid using that rule at all, we will append a special character

called dollar to the end of the string for which we'll build suffix array.

So all the suffixes will have this dollar on the end.

And now if initially some suffix was a prefix of another suffix.

Now it is just smaller by the usual rule, because as soon as it ends,

and it is still coinciding with the prefix of the bigger suffix, the next character

in the smaller one is taller, which is smaller than all other characters.

And so we can determine by the usual rule

that the smaller suffix is actually smaller.

3:27

So how the suffix array changes in this case.

We have initial string S, ababaa and

we append dollar to the end and we get S prime.

And now all the suffixes in the lexicographic

order of string S prime are $, a$, aa$ and so on.

And if we just remove dollar from the end of each of these strings we

will have the suffix array of the initial string s, preceded by an empty string.

4:09

What about storing the suffix array,

suppose we have some algorithm to computed.

How are we going to store?

We want our suffix array to be stored in a linear memory, but

the total length of all suffixes is some of arithmetic progression from one

to the length of the string, which is quadratic.

And so, it will take too much memory to store the suffix rate,

even if we can compute it fast.

So, we need to store only the order of the suffixes, not the suffixes themselves.

And the order is just a permutation of numbers,

and the number of those numbers is the length of the string.

So that will take [INAUDIBLE] of your time.

And so that is what we mean by suffix array.

This order.

So it will be just an array of positions.

And all the positions are from 0 to length of the string minus 1.

And the array has length equal to the length of the string.

Now let's look at an example of such order.

So we have initially string S which is ababaa$.

And we number all the suffixes by their starting positions.

For example ababaa$ is 0 and abaa$ is 2.

And we will start the order of the suffixes in an array order.

So the smallest suffix is just $ because $ is smaller than any other characters.

So the first suffix and the order is suffix number 6.

And the next one is a$ which is number 5 and then aa$ which is

number 4 then ababaa$ which is 2 then 0 then 3 and then 1.

So this is the kind of array which we call suffix array.

Which is the order of all the suffixes of the initial string.

And, we don't store the suffixes themselves.

However, if we need to look at, for example, the third character of

the second suffix in order we can first go into the area order,

find out which suffix is number two, and

that will be the first position of that suffix in the stream.

And if we then add two to that

we'll get character with position two in that suffix.

So in theory we can look at any character of any suffix

really efficiently although we don't store those suffixes directly.

Okay, now you know how to store the suffix array and

how to manipulate it efficiently.

But you probably wonder how to actually construct it.

And you will learn that the next few videos.