Now we're going to take a look at what happens when we have significant numbers
of duplicate keys which is not at all unusual in practical applications. So, in,
in fact, often, the purpose of a sort is to bring items with equal keys together
for like the example that I gave where we had cities and time. There's a lot of
detailed data and the time and maybe the whole goal of the sort is to group them by
cities so we can ship out the data for each city, to each city and there's plenty
of other examples like that in data processing where we find maybe remove
duplicates from a mailing list or all the job applicants that we get, we might want
to sort them by the college attendant. So the sort does huge files with huge numbers
of duplicate keys. So, a sort, it's worthwhile to take a careful look at what
the implication of that is. So again, typical characteristics we have a huge
file but small number of different key values. So let's look at our algorithms in
that situation. So Mergesort, it doesn't matter that much what the key values are
like and it's actually, we can show that Mergesort always uses between one-half, N
log N and N log N compares. Quicksort actually, they're up until the 1990s the
most widely used implementation took quadratic time. For files with large
numbers of equal keys and that was actually found by applications user and,
and that's the standard Quicksort that was in all the textbooks almost all the
textbooks if you did not stop the partitioning on equal keys it would run in
quadratic time. So we want to do better than this. So the mistake happens if we
put all the items equal to the partitioning item on one side which is a
natural way to implement it and the consequence is if you have all the keys
equal, then partitioning doesn't really do anything. You just peels off one key to do
file size n then you get a sub file size n - one and then n - two and so forth and
the result is a quadratic tim e algorithm. Our implementation, we stopped the
partitioning scans on items equal to the partitioning item and then in that case,
when all the keys are equal, it's going to divide it exactly in the middle. And then
in that case, when all the keys are equal, it's going to divide at exactly in the
middle. And then in that case you get an N log N compares. But actually when you
think about it, why don't we just put all the items equal to the partitioning item
in place. That's, that's really a desirable way to look at it and let's take
a look at that option. So the goal is so called three way partitioning. So what we
want to do is get the array into three parts so then now we have two pointers
into the middle. One that is the boundary between the keys that are less than the
partitioning element and those that are equal of the partitioning element. Another
one that's the boundary between the keys that are equal of partitioning elements
and the one that is greater. And then in the middle are all the equal keys and
that's what we'd like to arrange. Now until the 1990s, conventional wisdom among
people implementing system was, wasn't worth doing this. But, but actually it's a
problem that Edsger Dijkstra had proposed in the 70s as an example of, of
programming problem. He was really interested in analyzing correctness of
programs and showing that this how you could convince yourself that this program
was operating as expected. But in the 1990s we figured out that really this was
going to be an effective way to sort. And this was taking a look at the Qsort that a
user found was broken and, and now, this method is incorporated into some plenty of
system sorts. So let's take a look at how it works with the demo its more
complicated than standard Quicksort partitioning. A bit more complicated
because there's more to do. So now we have our i pointer which is right to the left
of stuff we haven't seen ye t and then, we have two other pointers that maintain,
maintain these boundaries everything to the right of gt is known to be greater
than partitioning element. Everything to the left of lt is known to be less and
between lt and i is known to be equal. And all the method does is maintain this in
variants so let's do an example or two and see how that works. So we increment i and
then figure out what to do. So, now it's less than the partitioning element. So,