0:03

So, Huffman coding is an optimal prefix-free code. And can we have a better

Â data compression algorithm? and as you might suspect the answer is yes. And

Â that's we're going to look at next it the LZW algorithm, named after Abraham Lempel

Â and Jacob Ziv, and W is for Terry Welch who got involved. and so the way that we

Â improve on Huffman coding is by changing the rules under which it operates. And

Â we've already changed the rules once, so let's take a high level view. so one idea

Â for codes is that we're going to, for all text that we ever encounter, we're going

Â to use the same code or the same model of the text. and so everybody's got that

Â model, we don't have to transmit it. but the problem is that one text might have a

Â lot of As another one might have a lot of Zs. They have different statistical

Â properties. so, it's not going to be optimal for a given text. but still, over

Â the preponderance of text that we see this can be effective. And so, that's really

Â what an ASCII is. but now, for Huffman, well, we consider, we change that rule.

Â And we said, well, we're going to use different encodings depending on what the

Â text is. so, we're going to look at the text, figure out the frequencies and then,

Â go ahead and create a code that works for that text. now, we have to transmit the

Â code. that's what the Huffman's code is. for every text we've got a code that's

Â suited for that text. what we're going to look at now is another change in the

Â rules, where it's called an adaptive model, where we're going to just read the

Â text one character a time. but we're going to use the text to update the model and

Â learn about the model as we read it. and we're going to assume the decoder does the

Â same thing. And this gives perhaps it does give more accurate modelling and better

Â compression over the, throughout the text. And it's a very effective method, so let's

Â look at how it works. we'll just do this through an example. now, this is, We're

Â going to use hex values. so, in LZW encoding we start with our input, which is

Â which is characters. O r say, ASCII characters. and what we're going to do is

Â keep a dictionary of code words. and the code words are going to be fixed length.

Â and so, we'll use eight bits or two hex characters as an output. And we'll start

Â out with just the ASCII value for all the characters that we're going to use in the

Â message. so, that's the starting point. so, at the beginning, so this is going to

Â be compression for this string here. at the beginning we just pick off characters

Â and output their value from the from the table. but every character that we read,

Â it gives us a little new information. For example, in this case we've seen AB and

Â what we're going to do is say, okay, we've seen AB, maybe that occurs again in the

Â text if it does we'll since, since we've seen it, we know where it is, we know what

Â it is. we'll just give it the value 81 and we'll maintain this table. If we see AB

Â again we'll use those eight bits. so we put out the B, but we remember the fact

Â that we saw AB. And similarly, when we see BR then that will be an 8-bit code word

Â that we can put out. and the idea is that the decoder or the expander can do the

Â same thing. We don't actually have to transmit this table. we can know that the

Â expander is going to do the same thing, and we'll have the same a table. So now,

Â we see the R. So, this is BR. So then, and that's going to be encoded with 82 but the

Â R is asking for R52 so we put that out. now we see the A and so, that's going to

Â be 83. If we if we see it again and, but the A is just 41 and then the AC is going

Â to be 84 the C is the 43. CA is going to be 85 and then 41. and so, what we're

Â doing is building up a table that is a model for what the string has and it will

Â allow us to get better compression as we get layer into the message. so now, we're

Â reading the A after the B, so D is going to be 87. but now, we see that the next

Â letter is B, we look in our table in for AB and it's there. So, we can put out 81

Â instead of just putting that A out there. So at this point when we look at the B we

Â can know that we had AB. And we can look that up on the table and just put out 81.

Â and similarly we when we, but now, when we look at the R our previous character was

Â AB. So now, we're going to code ABR with 88. and again, RA if we look that up in

Â the table and we can put out 83. and now, we can remember RAB is going to be 89. so

Â now, what happens next? Now, we look at our next characters. we look for the

Â longest prefix that we can match in the table and that's going to be the code word

Â that we put out. So in this case, we have BR that's in there that's 82. and the next

Â character is A so I'm going to put a BRA in there. and now this is the remaining to

Â be encoded and we're going to look for the longest prefix that we know the code for,

Â and in this case, it's going to be ABR, which is 88. So the previous characters in

Â the text built a model that allow us to more effectively encode the text. and

Â that's 88 ABR and then all that's left is the last A. so, for this small example

Â the, compression is not great, but the idea is still there's definitely some

Â savings in there, cuz these code words, the input was 8-bit ASCII code words, and

Â these code words are, are also eight bits, and the key thing is we don't have to

Â transmit the table, cuz we can assume that the expander is going to rebuild the

Â table, the same way that we built it. so, that's the basic idea beside behind LZW

Â compression. We also have a stop character at the end that says, it's the end of the

Â message. So, this is the summary of LZW compression. we created a symbol table

Â that associates fixed length code words with string keys. We initialize it with

Â code words for single character keys. Now, when it's time to encode part of the

Â input, we look for the longest string in the symbol table that's a prefix of the

Â part of the input we haven't seen. and that string, that's the longest one, its

Â the best we can do. we're going to have a fixed length code word for that string.

Â And then, we look ahead to the next character in the input and add a new entry

Â into the symbol table with tha t one extra character. that's the LZW compression. so

Â one question that comes up is how do we represent that code table in, in code. And

Â actually, the answer is really simple. we're going to use a trie. because what a

Â trie can do for us is if you remember, when you looked at the tries, if you don't

Â know, when we looked at tries, what we did was support longer prefix match operation.

Â so, this trie represents all the prefixes that we have seen so far in the message,

Â and it's very easy if the next four characters in the text for ABRA, then we

Â have the code word for it. for anything that we've seen in the text we've got a

Â code word for a fixed length code word. So, as the trie gets bigger, as we move

Â down more into the trie, we get better compression cuz we're using the same

Â length code word to compress more letters. so here's the implementation of LZW

Â compression. again very simple implementation for such a sophisticated

Â algorithm, really. and we're actually going to use the TST so that we'll have to

Â worry about the extra space. and so, first thing we do is initialize the TST with a

Â code word for each of the single characters, so it's rated XR, so there are

Â different letters. and we'll just put an entry into the trie for each one of the

Â letters along with its coding. and so we didn't show, we only showed a few letters

Â in the examples, but in general, we'll, we'll assign the code word i to each one

Â of the, to the ith letter. and then, the rest of the trie is built from the input

Â string. And you know, so the very first thing we do is find the longest prefix of

Â the input string in the symbol table. and that longest prefix of method just marches

Â down the trie eating off the characters in the input string one character at a time

Â until it gets to the bottom, because the bottom it has a code word. so it, it

Â actually, the symbol table method actually gives us back the string. and so, then we

Â can use that to get the value associated with that string out in the symbol, in the

Â second call. And that gives us the code word. an d then what we want to do is get

Â the length of that longest prefix match and add one more character to it and add

Â which is the next character in the input. and add that code word to the to the

Â symbol table. and then scan past that in the input. so that's the entire

Â 14:32

done. We see 41 we get an A. Now, we put RA in the table 43 we put a C and we put a

Â C in the table 41, we put an A. we put CA in the table now 81. we look up at our

Â table, and it's I'm sorry. the D is 44 and AD in the table. And now 81, we look at

Â up, and we have AB and once we have seen the AB then we know the compression were

Â to put DA on the table. And that's a little bit tricky because the expansion is

Â kind of one step behind the compression. It's got to put out the characters before

Â it knows it. And it does lead to a slightly tricky situation that we'll talk

Â about. So now 83 is going to be RA. And once we put out the RA, then we know that

Â compression would have done ABR now, so now 82 is BR. And again, we know

Â compression would have put RAB in there. and 88 is ABR and compression would have

Â put BRA in there. 41 is A and then 80 is the stop character. Oh, and we would have

Â put and once it did the a, then it would have put ABRA in there. And then, 80 is

Â the stop character. So, that's an expansion just building the table in the

Â same way the compression would have and then, using the table to figure out the

Â string associated with, with each fixed length code word. so this is a summary of

Â expansion which is pretty much the same as for compression except the reverse. so

Â we're going to create a symbol table that has fixed linked keys and associates

Â string values with them. We put in the single character values. we read a key, we

Â find the string value and write it out. And then, we update the symbol table from

Â the last the two things that we, the first character, the thing we just wrote out and

Â the thing we wrote out before that. and again you know, to represent the, the code

Â table this time we can just use an array. and because we, our keys are, are fixed

Â linked. And we assign them one, one bit at a time. So, we can just store the strings

Â in an array. And use the, the key bits, the key values, the fixed bits, key values

Â we get to index into the array and give us the right string. so that part's easier.

Â so now there's a tricky case that really everyone needs to work through this case a

Â few times to really understand what's going on. And it's worth doing once even

Â in lecture. so, let's look at this case where we have AB, AB, ABA and just follow

Â through th e algorithm for this for this case. so we get an A, and we look it up,

Â and it's 41. and so, that's what we're going to put out. and then, we look at the

Â next character. And it's AB so we're going to remember 81 in our code word table.

Â then, we read a B. and we look it up, and it's 42. and the next character is A. So,

Â we're going to say, well, BA is going to be, be 82. then next character is B. We

Â look up AB and we have 81. and so, we can put out the 81. And now, the next, look

Â ahead, the next character is A. So, we're going to do a code for ABA. That's going

Â to be 83. now, we have the rest of our string to be encoded is ABA and our

Â longest prefix match is ABA. so, we're going to put out 83. and that's the end of

Â the string. so, we do the 80, which is the end of the string. So, that's compression

Â for that string, working the same way as for the other example. now, lets look at

Â expansion for this case. they start up the same way. so now, we see a 41 and we look

Â it up in our table. and again, this can be just an indexed array. so it's going to be

Â a so that's a starting point and now, 42 is going to be a B and then we look back,

Â we just put out an A and a B so our compression algorithm would have encoded

Â an AB as 81, we know that. So now, we can when we see 81 we've got AB so we can put

Â a B. and so now, we look at the and we put out the AD and the next entry in our table

Â is going to be BA, because that's what our compression were to put out. but now, we

Â just got a problem. the next character that we see that we need to expand is 83

Â but we need to know what key has value 83 but it's not in the symbol table yet. so

Â that's definitely a tricky case for the now it is possible to when we go through

Â that one again. so I, at the time that we, we read the 83, we need to know which key

Â has value 83 before it gets into the symbol table. in all of its first

Â characters is going to be A, so that means that ABA has convenience with the code. In

Â a book which you can examine has a way to test for this case, and it's definitely a

Â tr icky case that at least can be considered for LZW expansion. we expand 83

Â at the same time that we put it in the symbol table, in this example. okay. So

Â there are all different kinds of details for LZW, we've just given one sometimes

Â people find it effective to clear out the symbol table after a while v maybe how big

Â do we make this symbol table, how big do we let it get. maybe it's not worthwhile

Â to keep older parts of the message. It should adapt the pieces of the message.

Â there's many other variations like that, that have been developed. so, so, for

Â example, what GIF does, is when the symbol table is full, we just throw it away and

Â start over. Unix compress it's throws the keeps a measure of how well it's doing.

Â And throws it away when it's not being effective. and there's many, many other

Â variations. why not put even longer substrings? Why just one character at a

Â time? Why not the double ones and so forth. And there have been many variations

Â like that have been develop, developed. and actually in the real world the, the

Â development of this tech, technology was influenced by patent law. There's a

Â version called LZ77 and another version called LZW because these guys worked for a

Â company in the summer and then it was patented for a while you couldn't use LZW

Â unless you paid the patent holder. But that expired in 2003 and then things

Â started to go up again. so there's lots and lots of different effective methods

Â that are variants of LZW. and really to do a good job you might also, you need to

Â 23:01

include Huffman coding for the characters or run-length encoding and in really just

Â combinations of these basic, basic methods that we talked about that are, are most

Â effective. so you see this technology author up computational infrastructure

Â that you use everyday. and, and at the time we were talking about tries, they

Â seemed a bit abstract, but actually they are, tries are definitely part of the

Â algorithmic infrastructure and they are really fine example of a clear, clean,

Â elegent data structure and algorithmic idea of be ing used to great effect in the

Â real world. and this is there's people, plenty of people still doing that

Â research. Even on lossless data compression there's still ideas being

Â developed. and these are the kind that does a famous benchmark of set of texts

Â that if you think you have a good new compression algorithm you can run it on

Â this benchmark. where it's asking you is a seven bits per character, it's eight bit

Â but one of the bits is redundant, so you will immediately get down to seven. these

Â are the kinds of compression ratios that people have achieved with various mostly

Â levels of variance down to less than half of what you can get with asking. but it

Â continues to drive down and there was a completely new method called the

Â Burrows-Wheeler method developed in the 90s' that took a, a big jump down and

Â there's a few more that have continued to improve even through the 90s'. So there's

Â still room for improvement in lossless data compression, but it's a really fine

Â example of the power of good algorithmic technology. so here's our quick summary on

Â data compression. so, we considered two classic fundamental algorithm. The first

Â one, Huffman encoding, represents fixed length symbols with variable length codes.

Â so the prefix code uses, tries to use smaller number of bits for the most

Â frequently used symbols. the other way the, the LempelZiv method takes variable

Â length symbols and uses fixed length code. So, it's interesting to think about those

Â two ends of the spectrum. there's plenty of compression algorithms out there that

Â don't try to get an exact compression, but just try to get an approximation using FFT

Â and wavelets and many other mathematical techniques. And that's appropriate for

Â things like pictures and movies where maybe you can miss a few bits. But

Â lossless compression is still a very important when you download an application

Â which is machine code onto your computer. you can't afford to have one of the bits

Â be wrong. so that's why lossless compression will always be with us.

Â there's a theory on compress ion. This theoretical limits that is, is a

Â measure of the and it's called the entropy, Shannon entropy of a text that

Â says a, a very fundamental way to indicate how much information there is in a text as

Â a function of its frequency. So, it's a sum of overall characters of the product

Â of the frequency. The log and the frequency with these methods we can get

Â very close to the entropy in some theoretical session settings. but actually

Â in, in practice the most effective methods uses tricks and techniques that have extra

Â knowledge about the data being compressed to really get the most effective kind of

Â results. as I explained, if you're doing a page like this, you better use a method

Â that does really well on huge amounts of wide space for example. that's LZW

Â compression and our last compression algorithm.

Â