0:03

Next, we'll look at Huffman encoding. this is a classic algorithm that is still

Â widely-used and effectively used today. it's definitely on the list of algorithms

Â that everyone should know cuz it's quite ingenious. so we start out with the idea

Â of variable length codes. so far, we've talked about codes where every character

Â is represented with the same number of bits. But there's no reason to do that.

Â For example, look at Morse code. the idea in a way that we assign dots and dashes to

Â letters is that frequently, you'll use letters should have smaller number of

Â characters. so for example, an E is only one dot and so forth. And letters that are

Â used less frequently, like Q are going to be longer. More dashes and more

Â characters. So with this we can, it's an idea, a compression idea, of let's use

Â fewer characters for frequent fewer bits or fewer code characters for frequently

Â used characters. now, there's an issue with this and that is that it can be

Â ambiguous. So, this message everybody knows, looks like SOS, three dots followed

Â by three dashes, followed by three dots. but actually it could be it's any one of

Â these others. Like V is dot, dot, dot, dash. Thats good. and then, seven is dash,

Â dash, dot, dot, dot, so it could be V7. Or it could be either of these two. There's

Â lots of different things that this thing could be. So, Morse code is actually, it's

Â not just dots and dashes it actually has they have a little space between the code

Â words to avoid this problem that there's an ambiguity caused by the fact that one

Â code word can be prefix of another and there's no way to tell that it's not

Â without separating the code words. so now that's the way it's solved in Morse code

Â and that's not necessarily the most efficient way to solve it, and that's what

Â Huffman codes addresses. Also, a problem with Morse code is don't be trying to

Â transmit a lot of numbers with Morse code, because the codes for those are pretty

Â long and, and you'll have much longer representations for codes that involve a

Â lot of messages and involve a lot of numbers. But there's a really elegant way

Â to avoid ambiguity. And that is to just adopt a rule that no code word is going to

Â be a prefix of another one. so fixed length codes do that. If every code is the

Â same number of bits and every character encode different bits, then no code word's

Â a prefix of another. They're just all different. another thing you can do is

Â have a, a special stop character like, like the space in Morse Code to end to

Â each code word. That's really what it's doing. so the code words are all

Â different. And they end with a character that's special. And so, none can be a

Â prefix of another one. but there's also just an easy way to just make sure that

Â you use a code that has this prefix free property. so, for example, this code here

Â no code word is a prefix of another. And it means when you have bits there's a

Â unique way to decode. So here, the compressed bitstream starts with zero, the

Â only way that can happen is if the first letter is A. Then, when you're done with

Â the A you've got four ones and that means it's got to be a B. and then, you get rid

Â of the B, and so forth. And it's three ones and a zero, it's got to be an R, and

Â so forth. Since no code word is a prefix of another you can just read off the code

Â from the bit string without having any special stop character or any efficiency,

Â inefficiency caused by that. all we have is the bits and we are able to uniquely

Â decode it. And there is numerous prefix-free codes. For example here's

Â another prefix-free code that for the same five characters and it actually uses fewer

Â bits. So, the idea of Huffman encoding within these rules where we must have a

Â prefix-free, free code. And we've got a particular message. what's amazing about

Â Huffman encoding is, that it finds the code that uses the fewest bits for that

Â message. and that's why, you know, it's so effective. So the interesting thing one

Â interesting thing about prefix-free code is that they're really easy to represent

Â with trie. and, and the idea is very simple. We put characters in the leaves

Â and we imagine that every, this is binary trie, so we imagine that there's only two

Â links per node. And we imagine that every left link is labeled zero and every right

Â link is labeled one. and then, just following the path from the root to a

Â leaf, so say we go right, so it's one, zero, zero, that's D. Those bits are not

Â stored in the trie, we just keep, we just implicitly keep track of them. if we go

Â left we keep zero, if we go right we keep one. And so a try for prefix-free code

Â uniquely determines the code. and that means that just working with the trie we

Â can decompress a bit string just by starting at the root, reading the bits

Â take them where they take us, and when we get to a leaf, we've got a character. so

Â that's the trie corresponding to that code, that's the trie corresponding to

Â that code. this one starts out with one, one, so starting at the root, we go one,

Â one, and the first letter is A. And then, the next letters are zero, zero, and so we

Â go zero, zero, and we come to B, and so forth. a simple representation of a

Â prefix-free code that is that makes for easy decoding, given the bit, bit string

Â in the trie, it's pretty easy to decode. you could also keep a symbol table of

Â pairs but sorry, for compression, how do we do compression that was in expansion?

Â so, for compression, we can just keep the table and use it. you could also use the

Â trie to figure out the code by following the path h, up to the root. and for

Â expansion, as I just said, you start out at the root I go left at zero, right of

Â one and it's very easy. so for among comp, expansion, so we will look at the code for

Â both of these. The question is how to construct, the first question is how to

Â construct the trie. so let's start with the data type. so this is similar to other

Â tree structures that we've built before where we have a character that we don't

Â use for internal nodes, just for leaves. we've got the frequency that we use while

Â we're building the tribe, not when we're expanding and then every node has a left

Â and a right. and th is, is the constructor, you just fill in all the

Â fields. we need a method to tells us if the current node is a leaf. And that's

Â when so that's if both its links are null. and we need to compare it to, to compare

Â nodes by frequency. And we'll see why that's needed later on. so that's the data

Â type for the nodes that we're going to use to build the tries. and so now, let's look

Â at expansion. So, this is in code what I just talked about. so, we're going to have

Â a method that reads in some of the bit string, and converts it to a trie. Now,

Â that's one clever part of the algorithm. and then the first thing is the number of

Â characters in the code in the message, you know, we also get that. So, we get the

Â trie, and then, we get the number of characters. and then we simply for do a

Â for loop for the number of characters and start at the root. if we're not at a leaf

Â we read a bit. And if it's zero, we go left, and if it's one, we go right. and

Â that's it. if as soon as we get to a leaf, we just write out the character that we

Â get and then, go back to the root and keep going. We're not in a leaf if we read a

Â bit, we have zero, go to the left, one, go to the right. And when we get to a leaf,

Â we write out the character extremely compact code for expanding. given a bit

Â string, the first thing is to trie. We have to talk about how to get the trie in

Â from a bit string number of characters. and then, the simple loop that expands

Â according to the trie representation. so that's networks for any prefix-free code,

Â not just the Cuffman, Huffman code. in the running your time, it's obviously linear

Â in the number of bits cuz for it's just a loop that chews up a bit every time in the

Â loop. Okay. So, so again, for any prefix-free, free code, how are we

Â actually going to write the trie? well it's a little, little clever but by this

Â time, you should be ready for this kind of algorithm. what you can do is traverse the

Â trie in pre-order and when you and then come to an internal node you write a zero.

Â when you come to a leaf, you write a one f ollowed by the 8-bit character at that

Â leaf. so it's a simple pre-order traversal. if this is a recursive program

Â to write out in trie if you're at a leaf, you write a one followed by the 8-bit

Â characters at the leaf and you're done. and if you're not at the leaf, you simply

Â write a zero. and then recursively, do the two sub-tries. and that gives a unique

Â representation of the trie that can be used to construct the trie from that bit

Â string. so and the idea is that typically, we're talking about a, a very long message

Â the trie itself is just basically encodings of all the possible characters

Â in a message. and that's going to tend to be small compared to the length of the

Â message. so for say, a genomic sequence there would be only four characters and

Â the ones that used most frequently, hopefully would be encoded with not that

Â many bits. of course, but anyways, the size of the trie itself is going to be

Â much smaller than the length of the message. so reading in the trie and this

Â requires a little bit of thought, but and we see the code, the code is extremely

Â simple. We justreconstructed from the pre-order traversal to trie. So, this is

Â the readTrie method that we needed, that reads the bits from binary standard in and

Â produces the trie, and it's also recursive. and if the bit that you read is

Â zero that sorry, if the bit that you read is one that means you have to create a

Â leaf otherwise, you recursively read the left and read the right. and make a new

Â node that has a subtree that, that denotes to the left and the right. And it doesn't

Â it doesn't use the character in the second one, the frequency we initialize to zero.

Â so, in internal nodes, we don't use the character. So if you look at what would

Â this code do for this bit string so the first bit is zero. so it's going to

Â 13:20

recursively call itself to read the trie on the left. and so, the next bit is a

Â one. so, it's going to read the next eight bits to get the A. And it's going to

Â return a new node with that character in it that's a leaf. so, that's going to be

Â the left subtree of the initial trie, the first recursive call. and then, it'll read

Â the right subtree, which is an internal node, another internal node. It takes a

Â little thinking to convince yourself this, this works. but it does. And it's very

Â compact and easy code. And again, works for an prefix code. you encode the trie

Â with the prefix traversal. And with a very small amount of code, you can reconstruct

Â the trie and transmit a bitstream. And so, this is a compression of a trie structure.

Â so now, the question is, how do we, there's a lot of prefix pre-codes, how we

Â will find the best one? well, and there's an idea called the Shannon-Fano Algorithm

Â which says the, the, the key idea is that you've got characters that appear with

Â different frequency. And you want to get the ones that appear very frequent to use

Â fewer bits. and so, it's of interest to find the ones that are very frequent, and

Â so the Shannon-Fano algorithm says, just divide all the symbols in, into two

Â roughly equal frequency and start the ones in the first one with zero and the ones in

Â the second one with one, it's kind of a balanced code. and then, and then kind of

Â re, recur. so if you have A appearing five times, and you C appearing one time at six

Â for those and then you get six for all of those. And then you try to encode those

Â with and try to use zeroes for roughly half the character. And one for roughly

Â half the character. so in order to deal with that, you have to figure out how to

Â divide up the symbols. And you can imagine an algorithm for doing that. but the real

Â problem with this method is that it's not optimal. and so that's the kind of

Â situation that Huffman was faced with way back when coming up with the algorithm.

Â and the best way to explain the Huffman algorithm is through a demo. And so,

Â that's what we'll do next. So, here's a, a demo of the Huffman algorithm. So, the

Â first thing that we do is count the frequency of each character in the input.

Â and so that's what the algorithm is based on. So, in this case, there's five As, two

Â Bs, o ne C, one D, and so forth. So our goal is to use fewer bits to encode A and

Â more bits to encode C, D, and exclamation point ,!,, for example and we want to

Â prefix-free code. so it's kind of a bottom up method. So what we do is we build one

Â node for each character and we keep in the node corresponding to each character, a

Â weight that's equal to the frequency. and then we imagine keeping them in a sorted

Â order. so that's how the method starts out. Now remember, our node representation

Â had an instance variable that allowed for storing these frequencies. And then the

Â idea is to given these are sub-tries of size one, the idea is to combine two of

Â them and merge them into a single tribe with a cumulative weight. And so we're

Â going to find the ones that are, the two that have the smallest weight and create a

Â new internal node for them. And the weight of that node is going to be the sum of the

Â two weights. So, that's the frequency of all the code characters below it. and

Â then, just put that into the mix. so now we have five sub-tries to deal with and

Â that's the algorithm. Select the two tries with minimum weight. in this case, it's,

Â since we're keeping them sorted so it's going to be the ones on the left. combine

Â 19:01

and notice that our letter that appear with highest frequency gets added in late,

Â which means, it's going to be nearer the root. so now, we have just the two and we

Â finish the algorithm. there's only two left now, so we combine them. I merge into

Â a single try. And that corresponds to prepending a one to the codes for all the

Â other letters and just having a one bit code for A. And that now winds up with a

Â single try. that's a demo of Huffman's algorithm in operation. So, the Huffman

Â Code is an answer to the question of how to find the best prefix code that's really

Â very simple to describe. Count a frequency for each character in the input. And you

Â start with one node corresponding to each character with weight given by its

Â frequency. And just, until you have a single trie, you select the two with the

Â minimum weight and merge them into a single trie by creating a new node with a

Â weight sequel or the sum of the frequencies. this algorithm is used in

Â many different and familiar compression applications from PDFs to GZIP to MP3s to

Â JPEG. and it's pretty simple to code up in Java given the tools that algorithmic

Â tools that we've built up. so what we're going to do is this is the F of the

Â frequencies have been computed. And we're going to build a try given the

Â frequencies. and we're going to use a priority Q in our generic priority Q, we

Â might as well put nodes on it. we implemented a compare-to method. all we

Â need is the compare-to to be able to build a priority Q from data of that type. So,

Â we have a priority Q of nodes and for all the possible character values, if they,

Â and if they appear in the message, they have a nonzero frequency we'll just put in

Â a new node with the given frequency that's a leaf onto the priority Q. And then the

Â Huffman algorithm is just as long as there's more than one node in the priority

Â Q. we pull off the smallest two nodes, we create a new node, that's apparent. it's

Â character value is not used. we compute the total frequency, the sum of the two

Â children and we make those children of that nod e, and then we put that back onto

Â the priority Q. And that's it. Take two off, put one on. it reduces in size by

Â one. So, this while loop is eventually going to terminate. when it's done,

Â there's only one node on the priority Q and that's the root of the trie. extremely

Â simple implementation of Huffman's algorithm using our generic priority Q.

Â you can also implement it by presorting the of the nodes as well. but this is a

Â particularly nice way to do it. now you do have to prove that it produces an optimal

Â code. That's in the sense that no code produces, produce, uses fewer bits. and

Â that proof is in the book. it's, it's not terribly difficult but we're not going to

Â do it in the lecture. the, and that's, Huffman proved that it in the 1950s. So,

Â there's no point looking for a better prefix-free code. and prefix-free codes

Â are so convenient cuz you don't need to worry about the ambiguity it's a fine

Â method that's pretty easy to do. So now what we have to do, one disadvantage of

Â Huffman and some other methods don't have is, we have to have two passes through the

Â data. we have to go through and read all the character frequencies and then build

Â the trie and then we have to go back through the data and encode the file,

Â either traversing the trie from bottom up or using just a symbol table for to look

Â up the code for each symbol in the message. and the running time is pretty

Â clear from the use of the trie, the trie, the number of characters in the number of

Â nodes in the trie is just the alphabet size. and so it's going to be alphabet

Â size log R because there's trie operation that might involve depth of log for using

Â heap implementation. and then you have to be the for each character, you have to do

Â some kind of lookup to encode. and actually a question is, can we do better

Â in terms of running time? and well, stay tuned, we'll look at a different method.

Â That's Huffman compression.

Â