0:00

Hi, in this video you'll learn the algorithms used in the initialization

phase of the suffix array construction.

And those are sorting of the single characters of the initial string and

also competing equivalence classes of those characters.

0:15

First we know that we assume the alphabet is finite.

And we can use thus counting sort to compute order of the characters.

You probably remember a counting sort from the first course, algorithm tool box.

If you don't, I encourage you to go through those lectures once again,

because we will use counting sort twice in the construction of the suffix array.

In the initialization phase and in the transfer phase.

Here is the pseudo codes for the procedure SortCharacters, which takes string as

an input and returns the order of the characters of that string as the output.

We start with initializing that order as an array,

of size equal to the length of the string.

And we'll also need another array count, used in the counting sort.

Which will initialize with zeroes, and which has size equal to the size of

the alphabet, not the size of the string, but the size of the alphabet.

1:11

And then what you have in the following two for

loops is just the familiar code for the recomputation phase of the counting sort.

We count the number of occurrences of each of the characters in the string and

then we also compute the partial sums of that array.

And then in the end, we go from the right to left in our string S.

We look at the character and we know that the partial sums array contains

the position after the position where this character should be in the order.

So we decrease the counter by one and we save our character

position in the corresponding cell of the array order and

in the end we just return the order of the character.

So this is just an implementation of the counting sort

as applied to characters of string has.

And it works in time proportional to length of the stream

plus size of the alphabet.

because we know that this is the running time of the counting sort for length of

S items, each of which can take only size of the alphabet, different values.

And I need to note here that typically the size of the alphabet is small like for

example, four letters, four streams in a genome, or 26 characters.

If we are only working with the English words, or maybe alphanumeric characters.

Then there will be 26 small letters, 26 big letters, and 10 digits.

But sometimes the alphabet can be very very big, such as Unicode.

And in this case counting sort might not be appropriate.

If your string for example has only 1000 characters but

those are all unique code, and the alphabet size is a few million character,

then maybe you could sort the characters of this string in a more efficient way.

3:13

Apart from sorting the characters, we will also need additional

information to make the following steps of the algorithm more efficient.

And to do that, we introduce equivalence classes of the partial cyclic shift.

So we denote by c with index i, partial cyclic shift of length L, where

L is the current length of the cyclic shifts, which we already have sorted.

And initially, we have sorted single characters.

So L is equal to 1.

And then on the further phases of the algorithm,

L will increase from one to two, to four, and so on, twice in each iteration.

So, some of the cyclic shifts can be equal to different cyclic shifts starting

in different positions.

Ci can be equal to Cj and then they should be in the same equivalence class.

So to assign equivalence classes, we define the area class.

And class of i is equal to the number of different cyclic shifts

of length L that are strictly smaller that the cyclic shift starting at position i.

4:25

So for different cyclic shifts which are equal,

the value of class[i] and class[j] will be the same.

Because the same other cyclic shifts are smaller

than these two equal cyclic shifts.

4:40

And we'll need to compute this array class to increase the speed of the next phase.

And before computing this array class, we assume that

we have already sorted all the cyclic shifts of the current length L.

So, how to actually compute the classes of the cyclic shifts

when we already know their order.

Let's look at the example of sorted characters of the string.

So we know already that the characters are sorted, and their order is 6, 0, 2,

5, 1, and 3.

Now let's assign classes.

We want to assign class 0 to the smallest of the cyclic

shifts of the current length.

Which is dollar, which is in position six.

So, we write 0 in position six of the class.

And we initially set up a class to be of length equal to

the length of the string of course.

The next, smallest cyclic shift is letter a and

it is different from the previous smallest one which is dollar.

So we need a new equivalence class for a.

And so, we assign 1 to the equivalent class of

a which is in position 0 in the initial string.

So we assigned 1 to class of 0.

And the next one is also a which is already in position two.

But it is equal to the previous one.

So we are saying the same equivalence class to it.

So we'll write down 1 as the value of class of 2.

The next one is also a.

It is also equal to the previous one.

So we assign 1 to class of 4.

And the same one we do with class of 5.

6:27

And the next one is b which is different again from the previous one and so

we assign new class which is bigger by one which is two so we assign 2 to the value.

Value of class of one because b we find in position 1.

So class of 1, we assign to value 2.

And then the last one is also b it is equal to the previous one so

again we assign 2 to class of 3.

And now we know the classes of all the single character cyclic shifts.

We know that the smallest one is dollar.

And it is the only one that's equals 0.

We know that 4 a's are in the equivalence class 1.

And we know that 2 b's are in the equivalentce class 2.

7:11

Here is the pseudo code for the algorithm ComputeCharClasses which

takes its inputs string S, and the order of the characters.

And computes the equivalence classes just for

single character cyclic shifts of the string S, given their order.

So we initialize the array class with just an area of size equal to the length of

the string S and that will be our return value.

Also initialize the first value of this class array.

But we don't initialize class of 0.

We initialize class of order of 0, because order of 0 is the position

in which the smallest character in the string occurs.

And we initialize this character with class 0.

So we assign class of order of 0 to 0 saying

that the character in position order of 0 has equivalent class of 0.

And then starting from second character in order up to the end,

we go through the characters of string in order and we assign classes.

To assign a class to a new character,

we compare it with the previous one in the order.

If it's different from the previous one means it's bigger

because we go through them in the order.

And so we need a new class.

And so we just take the value of the previous class,

increase it by one and assign to the class of this character.

Otherwise, if this character is the same as the previous one,

we don't need to create a new class.

We just assign the same class as the class of the previous character.

8:44

And in that, we return the array with the classes.

And we state that the running time of this algorithm is linear.

Which is obvious, because we only have initialization of the array.

And then for loop, which runs for linear number of iterations

with constant number actions performed in each iteration of the for loop.

And that's all for the initialization phase of the suffix array construction.

And in the next video, we'll learn the transition phase from the current length

to twice the length of the cyclic shifts.