Welcome back. One way to acknowledge that there is correlation of the data is to perform a transformation which will decorrelate them. Then the transfer coefficients are uncorrelated and each of them conveys unique information about deimite. Some of them convey significant amount of information as measured, for example, by their variance or their contribution to the local energy of the image, while other have contributions close to zero. The latter ones can therefore be ignored or heavily quantized when compression is performed. This is the topic of the segment. We first discussed the decomposition of a nimite using an orthonormal basis. This is a straightforward extension of such a decomposition when we deal with vectors. Now, if we place the requirement that the resulting coefficients of this decomposition are uncorrelated, then we end up with Karhunen–Loève transformation. This is the optimal transformation for our purposes, but it has the major disadvantage that its bases are not fixed, but their functions of the autocorrelation of deimite that is, they're different for each image wherein it is encoding. This makes the K-L transform impractical. And instead, the transform of choice is the Discrete Cosine Transform or DCT. We show with a simple experiment the usefulness of the correlating transforms such as the DCT in deciding which part of the data to keep and which to discard. So, let us proceed with this interesting material. We covered in this segment transform based coding. This is a useful and popular approach, which as been adopted by most if not all image and video compression standards. We show here the three building blocks of any encoder. As mentioned multiple times, data compressible because they're correlated. With predictive approaches such as DPCM, this correlation is taken into account directly. For example, we are solving for the linear prediction coefficients by utilizing values of the autocorrelation function of an image. With transform based coding, we take the transformation of the data of an image or an image block. The objective is to obtain transfer coefficients which ideally are statistically independent. So the contribution of each coefficient towards the original image is independent from the contribution of the remaining coefficients. We then look at the relative importance of contribution of these coefficients towards the original image. For those coefficients that contribute more, we treat them gently. We quantize them with a fine quantizer. And for those coefficients that contribute considerably less, they're quantized coarsely or they're discarded. So this is the purpose of the quantizer here. We say that the quantizer has a spectral shape. It's uniform but the step varies depending on the frequency of each coefficient. The quantize transfer coefficients here are entropy encoded using any of the techniques we covered last week such as half one and arithmetic coding. It is therefore this bit stream that is sent to the decoder. Now, the decoder is undoing the steps of the encoder. So, the binary bit stream is entropy decoded and here the symbols which are dequantized transform coefficients are inverse quantized. This is actually a misnomer because quantization is an irreversible process. What it's referring to is to the fact that when quantization is performed, we divide the number, the value of the coefficient by the step of the uniform quantizer. And what is sent to the decode that is the quotient of this division. Therefore, inverse quantization or dequantization refers to the fact that this quotient is multiplied back by the step of the uniform quantizer to obtain the actual value of the coefficient. And then finally, we perform inverse transformation to obtain the reconstructed image block. So, let's look first at the characteristics of such transformations. So what are some of the important characteristics of such transformations? They should decorrelate the data. So in principle, the transform coefficients should be statistically independent. At the same time, they should perform an energy compaction, which means the small number of such coefficients should represent the majority of the energy in an image. The basis that forms such transformations should be independent of the image we try to encode. Otherwise, this basis should need to be transmitted to the decoder to perform the decoding. And finally, as always maybe, we would like to have fast implementation of such transforms since they're clearly part of both the encoder and the decoder. We show here a comparison of such transformations or the shown here is the log variance of each coefficient in the vertical axis and the coefficient number in the horizontal axis. So, the image is broken into sub-images. And then for each sub-image the transformation is found. So therefore we have as a result, another image with the same number of sub-images. So then we'll take the, let's say, zero zero coefficient from each block and find its variance. We'll take next the coefficient next with the same coefficient location or logs find its variance, and so on. Then we order the coefficients from largest variance here to smallest variance, as is clear from this log. The idea is that coefficients with the larger variance will contribute more towards the total energy of the image. So, a transformation that decays faster is a more desirable one in the sense that fewer coefficients represent a larger percentage of the energy. So, we compare here, the so-called K-L, or Karhunen-Loeve transform, which results in statistically dependent coefficients. However, the bases in this case, are functions of the image, more specifically, of the autocorrelation function of the the image. We also show here the performance of the DCT, the Discrete Cosine Transform, the Fourier transform, and the Haddamard transform. In comparing the Fourier transform and the Discrete Cosine Transform, a clear, distinguishing factor is that the Fourier transform coefficients are complex numbers, while the DCT coefficients are real numbers. And we much prefer working with real numbers than complex ones. So, the DCT has this distinct advantage over the Fourier transform. But more than that, the DCT is a transformation of choice, is used in many of the image and video compression standards. Summarizing what I already said in the previous Slide the KL transform results in statistically independent transfer coefficients. However, the main disadvantage is that the basis function, so KLT of the function of the image that needs to be encoded. The DCT is the transformation of choice. It's close to KLT for typical images and actually there theoretical is out that show that for a specific model for the image DCT and KLT coincide. The basis functions, functions of cosines of independent of the image, therefore do not need to be transmitted through the decoder. Since it's widely used in standards such as JPEG, the family of H.26x standards with the compression standards and the family of MPEG standards, they're also efficient implementations of the DCT. Embedded implementations or even dedicated chips to perform the discrete cosine transform. Before we proceed, let's establish some notation. Assume we have two n by n matrices, and here, for this segment, they're denoted like this. So it's the matrix phi, and the matrix gamma. So the inner product is formed by multiplying the corresponding elements one by one. So I'll multiply this times this element, and then this times this, and then this times this, and I will add them up. So the dot or inner product is clearly a scalar. Since in general I deal with complex matrices, then here I take the complex conjugate of each and every element of this second matrix. Then if I have a family of matrices, they are parametrized by u and v here. These matrices, they're orthonormal if when I take their inner product, it's equal to zero, when u is different than r, and v is different than s, and it's equal to one, when u equals r, and v equals s. So the inner product of a matrix from this family of matrices with itself, is one, that's where the normal, orthonormal parts comes from and then a product of one matrix with any other matrix is equal to zero. Given a family of orthonormal matrixes, phi u, v, that we just defined in the previous slide, then given any image f, it can be decomposed in terms of the spacing. These often are one matrices. So I can write f as shown here. So this means that the image f is a co-efficient here, f of zero,zero times phi of zero,zero, which is a matrix. Plus F of 0, 1 and then the phi 0, 1. And so on all the way to f n minus 1. N minus one phi. N minus one, N minus one. So it's the weighted sum where these are the weights here, the F's, of the basis function, these normal matrices. How does one find the coefficients here F of u, v. We have to perform the projection of the image f onto each and every of these bases phi. And this projection, this inner product we define, will give rise to this coefficient. We have seen the composition before with vectors, not matrices. So we're all familiar with the fact that If I have here an orthonormal basis, so this is vector e1, e2, and given a vector x, then I project x onto here. Let's say this is alpha 1, and this is alpha2, and to simply say that vector x equals alpha1 vector e1. I can put bars here to denote that these are vectors, plus alpha2 vector e2. So this is the extension of this composition that they assume you're familiar with when now we have matrices and not vectors as bases. Here's a toy example. We have this image of a smiley face and there's bases, we use Hadamard bases. I'll say a few more things later on. By projecting this image of the smiley face into each end of these bases, we obtain this altered coefficients. Only four of those are shown, and therefore, again, I can decompose this image as the weighted sum of these bases functions and these other corresponding ways. So, the main idea is that if these bases are known, the Hadamard base is to both encoder and decoder, all I need to send to the decoder are these projection coefficients. And the decoder utilizing these coefficients can reconstruct the image just using this expression shown here. I want to mention here the KL transformation for completeness. Our objective is to find the family of orthonormal basis, I call them phi UV, so that the resulting coefficients, capital FUV that are obtained by projecting the image onto these orthonormal bases are uncorrelated. So according to the KL theorem, if I have such an image f of m,n, and R of m,n,p,q is its autocorrelational function, then I can find this family of orthonormal matrices that will resulted in uncorrelated coefficients by solving this equation. What this equation tells us is that these orthonormal bases are the argon functions of the auto correlation function. And this delta uv it's simply the expected value of the magnitude of the coefficient square. So we have a specific procedure in obtaining these orthonormal matrices we are after. However, they depend on the auto correlation of the image, and although I can compute these files at the encoder, the decoder has no way of perform the same computation therefore this phi u,v is to be sent to the decoder. So it is mathematically optimal transform for our purposes, however is not practical one, because if I have to send the bases this defeats the purpose of compression. In the classic [FOREIGN] lesson, I prove all these relationships however they're clearly beyond the scope of this particular class. We see here that 64 eight by eight basis functions of the discreet cosine transfer. Here is the interpretation of the colors. So in this first row, we have in one dimensional cosine that increases in frequency horizontally, so vertically here it's constant. And in this column we have a one dimensional cosine that increases in frequency vertically. For the rest of the basis, we have a two dimensional cosine whose frequency increases both horizontally and vertically. And down here is the basis with the highest frequency into dimensions, I showed exactly this same plot back in week two when we talked about complex exponential signals, and I did make the comment at that point that would be encountering this basis functions when we talk about compression. Now is the time to do so. So as I have explained already given an eight by eight image block or image patch, we just find its projection to all these 64 bases. And if for example I find the projection or the inner product between this base and the eight by eight patch, a number will result which we denoted by F(u,v). And this is the coefficient. Clearly the projection of the patch into this here basis, it's black so all it's value set equal to one will give us the sum of the intensities here. If I divide by 64, it will give us the average value of the patch. So instead of having 64 density values here I have 64 such coefficients. These, by the way, are real, talk about the DCT transform. The advantage of working with the F(u, v)s instead of the F(m, n), as we denoted earlier, is that now if somebody asks us to keep let's say 15% of the pixel values which are more important. We have no way of knowing which 15% of the 64 pixel values are more important than others. However, when we look at the DCT coefficients Is rather straightforward to keep the 15% of the 64 but have the largest values. These coefficients again are in principle uncorrelated and therefore keeping the ones with the largest value, we will do a very good job in representing the original patch and also in preserving most of the energy in the original patch. Here we show this 64 eight by eight Hadamard basis functions. Here's the abbreviation of the colors, only two values. Plus and -1. So it's a similar description as with DCT. I should mention that in both cases, these bases are orthonormal. So in other words, the inner product of this with this is 0. This with any other is 0, but with itself is equal to 1. If it's normalized appropriately. So the story's the same. Given an eight by eight image patch it can find the projections under this basis of the Hadamard transformation coefficient which again are less correlated than the original image and therefore can serve our purposes for compressing the image. In order to demonstrate experimentally the effectiveness, the power of this correlating property of the transformations, we conduct here a very simple experiment. Given this image, we are asked to break it into eight by eight blocks. And then out of each of these blocks, we are allowed to keep only 8 pixels or 8 coefficients and we're supposed to throw the rest of them away. So trying to decide in the special domain here which 8 out of 64 coefficients are more important is very difficult. I have no means of doing it. However, if we go to the frequency domain, we keep pair block the eight largest coefficients. And if we do so by defining a threshold and keeping only eight out of the 64 pair block that are above the threshold. We are just the threshold so only eight are above the threshold. Here is the image. So we keep 8 and set the rest equal to 0. We see here that this is very close to the original one. We do the same experiment when now the location of the 8 coefficients are based on this Zonal description. So, if this is an eight by eight block in the Cosine domain, then we keep the 8 coefficients of our closest to the. So the 8 low frequence coefficients. Clear the thresholding by and large should give better results than the zonal approximation simply because there might be a large value coefficient down here, larger than any of the ones that they get here, but I have neglected. However when thresholding his used I have in addition to the coefficients transmit the location so beats for the location while this location for is fixed here when the zonal approximation is used. So if I look at the difference between the original and encoded versions Here is the difference scaled appropriately. And what you can see is that the difference is really large at the edges because the edges represent high frequency information which was discarded in this particular case. We conduct the same experiment when now only 4 coefficients out of 64 I've kept. And here is the result of thresholding, zonal. In this case, the same comments can be made as before, but I believe it's clearer here, that the zonal approximation results in a larger error, if we look at the intensity, or the error done, the method using thresholding. And again, it's the same experiment we're now only 2 out of 64 coefficients are kept, and I believe it's quite impressive that even though 2 out of 64 coefficients are kept, still we obtained the constructions here that convey quite a lot of information about the original image. You start seeing these [INAUDIBLE] artifacts clearly, but nevertheless, these, again, images are conveying quite a bit of information about the original image. Here, I believe the error is much more pronounced with the zonal approximation than when using thresholding. So this is the basic idea behind transform based encoding and as we'll see next when we discuss JPEG, these are exactly the steps taken by JPEG or any transform based encoding scheme. By and large, the difference is how to perform quantization. How to decide which coefficients to keep and which to throw away. But more specifically how to decide how to encode each and every coefficient, with a fine or a coarse quantized. Here these are simple ideas to just demonstrate the effectiveness of the correlation, but again, where the innovation and the novelty comes in is in the way, as you'll see, of quantizing these, quantizing as well as entropy encoding these resulting coefficients.