Now we're back to this layering architecture diagram that we saw back in lecture thirteen when we talk about the three underlying design principle of the internet. And here we wrote down a bunch of acronyms. Okay. network engineers are not shy about inventing new acronyms. You can actually have about 1,000 commonly used, acronyms and, probably a 100,000 all kinds of acronyms. Each standardization body can easily generate 5,000 acronyms. so you can have an encyclopedia of nothing but network. Protocol acronyms. Okay, but we're going to just highlight a few important ones. Now you see, the physical layer, which could be the wireless medium, of fiber, copper or cable medium. And then there's a link layer for example, next lecture we'll talk about wifi CSMA for controlling access to the communication medium. Then there's the network layer, running IP protocol for connectivity discovery. Then it comes to, something very important for video. The transfer protocol. Okay. We will see, soon that there's something called RTP and also UDP, that are, alternatives to TCP and often viewed as more suitable for video traffic. Then there is the application layer. Okay. There's a lot of protocols, important and familiar ones like HTTP, that you have seen using web browsing, okay. Sip, IGMP, RTSP are all multimedia related. Okay. We'll come to talk, some of these in a, just, about ten minutes or so. And then, I just separate this out. This is a little, non-uniform notation. Some people will call the compression standard as part of the application layer, as part of the media playback applications, but I just separate it out to highlight that things like NPAC and a bunch of other video compression standards, defines the way that video are encoded and delivered, and then they sit on top of the application here. So, our focus in the first this module of the video with be on compression and then the next module on application and transport, with a few more details in advanced material part of the lecture. And we will also see why certain transport layer protocols are chosen, in part because the kind of application compression layer technology used. So, we see layering at work in defining functionality allocation, who does what in delivering video. Before we can talk about compression, let's first define what is a video. A video is nothing but a sequence of frames moving at a particular speed. Okay, so this speed could be for example, 25 or 29.7 for standard definition TV. Okay, depending on the standard and a country you're talking about. For HD TV, say 50 or 60 frames a second, okay? This is the kind of number of frames a second that's needed to sustain a motion, depth perception in human brain. Now each of these frames per second is nothing but a still picture, a still picture is basically a bunch of pixels, okay. Even in rack in the display, the pixels are just very, very small for the human eye to detect the edge of the pixel. So what's a pixel then? Each pixel is really, the colors and luminers encoded and digitally in bits. So the colors can be based on a complex combination of the three primary colors or other kind of three color coordinates. And the bit rate therefore is basically the number of frames, per second times the number of bits per frame. So, what kind of number are we talking about here? Well, number of frames per second, would be 60 or 50, right? The number of bits per frame, depends on the, resolution, 'kay? Certain HDTV is 1280 by 720 pixels.'Kay? So that is altogether how many pixels? 921600 pixels. Or it could be 1920 times 1080, so that's 1920 by 1080. Kay, that would give you a much bigger number. Kay, 20736004, that's a big number kay? And that is just the pixel count. And then I have to say how many bits per pixel. per pixel. Well depending on how many colors are trying to encode, and the level differentiation of the luminous but let's say, you know, you need 32 bits for each pixel. And then you multiply the number of pixels per frame, then you multiply the number of frames per second. You will see this is a very big number for the raw bitrate of a meeting. At least an HD meeting. So here comes the road of compression. Compression is basically an exercise to remove the redundancy in signals, whether that's a data text signal, speech signal, image or moving picture. Signals. There are two kinds of compression. One is loss-less. Okay, basically you can compress the signal. Get it out, let it go through the pipe or the internet and then decompress it. This is compression, this is decompression. And then the output. Can be made to be identical to the imput. Not very similar, but exactly identical and that's lossless compression. For example, a famous lossless compression technique is Lemopzip. And that's the underlying compression method for zip that you use for unzipping and compressing files on your computer. And then there's law C compression that says what comes out of the decompressor is actually close to what went in the compressor, but not quite the same. So this signal is X, this is X hat. The I can look at the difference between X and X hat. For example, the L1 norm, the absolute difference, we'll use that in a minute, the L2 norm, okay. Now I can't look at basically the, each nut as a scale or single number, but as a long vector, 'kay, of the bits, representing the signal. In lossy compression, therefore, there's no free lunch. There's a tradeoff between compression ratio. That is, how much do you reduce the bit rate and the resulting fidelity? For example, hey I can give a compression of 100. I can, take 100 bits per second down to one bits per second. But the resulting fidelity might suffer. and if I make this compression ratio, less aggressive, say ten, then the resulting fidelity might be better. So in general we are, looking at this tradeoff through a picture called the rate distortion curve. so I plot, the rate on the axis, that is the playback rate for example or the encoding rate of the video, and D, is the distortion on the Y axis. This again can be measured by a one norm, L two norm, or whatever you like. And in general, the trade-off between the two is curve of this shape, okay? So convex curve shape, convex shaped curve. And if you can invent a better compression, you are basically, geometrically speaking, pushing this convex shaped curve towards the R region. In other words, for the same playback rate, bit rate, I can give you a much lower distortion compared to before. Or, alternatively, for the same kind of distortion that I require, instead of this much data, bit rate, I can take in a much, lower bit rate requirement. Okay. So this will represent an enhancement of the compression technology. Now which bit rate should I pick then? Part of that depends on the distortion that's tolerable. 'Kay. Including, factors like your kind of screen that you're using. Is it retina display or not? Part of that depends on the channel condition. If the channel is in a bad condition for congestion reason or air interface, interference reasons, then you may say, wow, let's take a smaller bit rate. And a part of that can also, in today's economic environment, depends on the usage quota. The quota aware video and adaptation to see how much quota you still have or we project you will have towards the end of the billing cycle, and adjust bit rate accordingly. So in general it can be a combination of these factors. And you may wonder what can I compress? How can I actually compress effective 100 effective 1000? Are you sure that, you know, only one% of.1% of the signal is actually needed? Well first of all there are distortions, so don't forget about that. And second, you'd be surprised to see amount of redundancy in signals, okay. For example, for motion picture frame to frame similarities can be striking, okay. Certainly for talk shows. But even for motion rich movies, because. Human perception our newer answer will bring rely on the similarities between one frame and the next frame to register motion. That's how motions can be registered. So it's precisely because of the redundancy. So when you transmit for example you can say," Gee, I will just transmit this picture. It's two guys fighting." And the next picture is still two guys fighting except this guy's arm moves upward. So just focus on this part, the rest remain the same. That's one way to compress. Take advantage of redundancy. Another way is take advantage of human visual limitations. Okay, even though there are differences we may not be able to process that in the brain. For example, there are a certain range of frequency in the signal that people tend not to be able to detect very well, the differences. So, examples like transform coding, you put it into the correct representation of coordinates in say frequency domain, then you look at the components of the signal and then you start ignoring the higher order ones. The third way is the statistical structure, okay. Certain things just happen a lot more often. For example in text or speech encoding, people use Hoffman coding. Where you give a shorter description length to more frequently occurring syllables or phrases.'Kay? In this way, the average, the expected length you need to encode a paragraph or a textbook will be smaller. And indeed, in, say, Morse code. You will see that the more often used set of letters in the alphabet are given a shorter representation. So what are redundancy of human visual limitation or statistical structure in the signal there are many places where you can compress. And indeed people worked very hard over the past twenty years and more to compress a motion pictures. So, N-pact is key family in compression standards. For example impact one back in 1992, was used for VCD. And we're talking about something like one megabit per second bit rate for encoding and playback. Then MPEG-2 which is also called H.262 because there's a United Nations ITU International Telecommunications Union standardization body that names a bunch of per, standards H.something. Okay. So whatever name it is it was done in 1996 and DVD Which you know, lasted ten plus years, pretty much is the dominant medium for storing TVs and videos In this use, now about ten megabit per second, much better quality than VCD. It is actually hard to find VCD'S these days. And, of course, you must have heard of MP3, you must have listed to MP3 music. Okay? These are audio tracks that recorded and compressed using the standard. It actually is not a, a stand alone standard, it is the layer three of Empact two. Okay. Layer three here has nothing to do with the protocol layers in the network community. Okay. This is one module of MPEG two, motion picture standardization where you compress the audio track. Now you may wonder what about MPEG three. There's MPEG one, two, four. well MP3 is not MPEG three. It's part of MPEG two. There were actually no MPEG three. It started but then it was absorbed back into MPEG two. And, and M-MP3, for example, can get a compression ratio of twelve to one for music. And to people who listen to it in a crowded spot or on the subway, probably that's good enough. Okay. But if you listen in a quiet space, then you can tell the difference between an MP3 music versus, for example, DVD quality. And then in 2000, it was impact four. And this is the current family of vida compression standard that people use. For example in 2004, the part ten of impact four, so people started to realize well lets name the different parts, instead of naming it impact five, six, seven, eight, so on. Now this part ten is also called h.2 sixty four. This is at this point, sort of, major, Be the compression standard. It's got sixteen so called profiles. It give you quite a bit of flexibility for different types of video. it is used for like HDTV and Blue Ray. Hd TV if you want a real HD TV it's more like twenty megabit per second. For blue ray something like 40 megabit per second. And this can get a factor of 100 easily, okay? They tend to be on the order of one to 200 compression ratio. And that's what made it possible to squeeze that many bits through the internet. Now these are not the only ones. There are quite a few others. For example H.261 was once upon a time quite a popular one. Okay, for IP video. Apple, has QuickTime that's merging into Impact four now, Windows has Windows MediaPlayer, okay, Adobe has Flash, and Real Networks has Real Media Player. So these are the main types of playback formats and standards, and there are as you can see still proprietary ones and that's what makes a little bit difficult for example a lot of Apple devices, they don't like Flash, As they believe it consumes more, energy than needed and so on and so forth. Now whatever, is the right side of the argument between Apple and Adobe you can see that there are so many different, compression standards out there. A lot of them, utilize the redundancy from one frame to another. And this is what we called a grouperfect picture concept. So we're going to encode all these frames, into blocks. Each block is called a, a GOP, a group of pictures. And a group of picture consists of three kinds of frames. Three types of frames. Not three frames, three types of frames, okay, called I, P and B frames. So I frame is the intra-coded frame, and each GOP always starts with an I frame. This is the frame that whose encoding does not depending on the frames before and after it. For example, when you switch from two guys fighting. To, you know, one guy laying on the ground. All right. So. You say, all right, this, I'm going to start an independent frame here.'Kay. And, there will be also something called a P frame. Okay, P is called a predictive coded, and this is type frame that depends on the previous I or P frame. So, it would depend on the previous. I or P frame, kay, as the reference point. For example, this is a P frame, kay? And then B frame, is called a bi-directionality predictive coded, and this is a type of frame that depends on both, before and after, okay. The I or P frames, that's before or after it. Now you see that I draw I and P and B frames with different heights, because the heights represent the size of the frame. I frame, as you can see, is the most important one, okay. It starts each GOP, and it's independently encoded. Doesn't depend on the previous or future ones. So it takes the longest number, the largest number of bits to encode. And the P frames tend to be much shorter, takes a lot fewer bits. And B frames leverage, the motion pictures frame to frame redundancy most.'Kay? Before or after. And therefore, it takes, the, fewest number of bits to encode a B frame.'Kay? Now. How should I decide the length of the GOP and what kind of IPB structure should I use? That depends on quite a d-, few different factors. For example, the bit rate efficiency. If you want to vary efficient compression then, you should say that, well once you start a I-frame, don't start a new I-frame. Just keep piling on P-frame and B-frame because these are small frames. Bit rate becomes lower. Okay? But, as we'll see in example momentarily, error resilience then suffers. [COUGH] Okay, if an I frame is lost. You have to retransmit the entire GOP. And a longer GOP means, if something goes wrong, like I Frame goes down, then you have to re-transmit a lot more frames. So, there's a trade off of between bit-write efficiency and resilience to errors. And then there's also instant channel change, at least in IP TV scenarios where the operators still have to maintain the TV-like experience, where you hold a remote and you you know, there's up and down button of the channel. Certain people watch TV like this, they just keep flipping, bom, bom, bom, bom, bom, bom, bom. Right they stay on each channel for about ten seconds, and then they flip through 200 channels, and then they go to sleep. So, if you want to give an instant channel change, 'kay? You press and immediately you can see the screen, as in the old days of TV. Then you have do quite a few things. We will see in a minute. you will actually have to, use unicasts to help, with multicasts of channelized content on TV. And another implication is that Because a GOP represents a single logical basic unit for playback. A longer GOP means that you have to wait longer to change channels. Okay you wait, you need to wait until a new GOP kicks in. So it's not that easy actually to decide either the length of the structure of each GOP. So here's an, a particular example to illustrate one point, that is, the error resilience of different types of frames, okay. Let's say we have a very simple example. Kay, we have a, a GOP structure that is I, P, B, B, that's it, kay? It's a very short GOP, kay? Start with I always and then suppose we say there's a P and then two Bs and then that's it. Only four frames. And lets say we measure error by the L1 normal. Okay, so the pixel value, okay. Before the I frame, at the coordinate x and y. So this is I frame, and coordinate x and y has got a certain pixel adding. Don't worry about whether this is the color or lumines just say this is a single number. Then suppose something happens, an error occured. Okay, and then you lead to a different a kind of a pixel value used at the playback. Now call that P bar. Sub y, xy. Okay. Then we want to look at the difference between, this and this, the absolute value of that difference and then you sum across X, Y coordinates and all the frames I. Okay, so this will be our error metric. And let's say we have a very small video okay, it's only two by two pixels, so altogether four pixels. And let's say this is a very boring video. The first I frame is just one, one, one, one. That's my video. Okay, say wow. This is a small video, okay, with lets say no color only luminance and is uniform spatially. I doubt if this will become very popular You Tube video. But let's say, it's actually gets even worse. not only it is boring, at one frame, it's very boring for all the future frames too. So the I frame is just all pixels, 1-1-1-1. The P frame is all pixels two. And then all pixels three. And then all pixels, four. All right. So that is my GOP. Okay? Two by two pixels, four pixel altogether in this, eat frame, and IPB B frame like that. Okay? Then, before this GOP, there is, the previous GOP. End it with, let's say, a, a blank screen, and that was the previous GOP's B frame. And then there is a next. Gop. Okay? And say, that is 5555. And that must be an I frame. All right. So, if everything go through, then there's nothing to talk about here, but we're talking about error resettings, so let's say that there's an error. And we're going to look at impact if we drop I frame, as opposed to dropping P frame, as opposed to dropping, say, the first B frame. Oh, how much is the error, according to this metric? All right. First example that I frame is dropped in the channel, in the network, the receiver say, no I frame, mm, then what can I do? Well, let's say the error treatment protocol on the receiver playbacks that if I frame's lost, then you just repeat the last frame. Suppose that's the case, 'kay? You could also that, if I frame's lost, please retransmit the whole GOP. I can't play without an I frame. That's another possibility. But let's say I frame dropped and then you just repeat. And what would happen is that. You will get, zero for the I frame. And the playback. Actually it's a 0000 but you know but advantage of using this boring media example is that we can just represent this video with a single number because every pixel is the same number. Alright, so it's zero okay? Now the P frame says, alright I need to look at, the previous, frame, okay? While the previous frame says zero so okay I'm going to say, is this zero? And the B frame, will say, I can look at either the IOP frame before me, or the IOP frame after me. Now after me is five, but that's the next GOP's, but nonetheless. So I say, well, there's my P, there's my I. The, the P one says, that, hey, you know, I'm just copying from a dropped packet. Okay. Then the B frame will say they have to locate the one after me, and the one after me is five, so I'm going to use five. And they're saying sorry for this B frame, that's five. So instead of displaying should be one, two, three, four for these four frames, now you're getting 0,0,5,5. As you can see the arrow basically become 1-0+2-0+3-5+4-5 but that is only for one pixel location. There are four such pixel all identical. So x4 and you get 24. So the error incurred is 24 units if you drop I-frams. So what if you drop P-frame, okay. The I Frame is properly received as one, kay, or one, one, one, one. The P Frame's dropped, so you just look at the previous I Frame and then it will Say I'm also one, 'kay. And same thing, the B frame, and we'll look at before and after. We'll say, well, before is lost, but fortunately after is still there, so I'm going to use the after the IOP frame the first IOP frame right after me. And that is five, just as last slide, and this is five. So now if you count the error, is actually sixteen, which is smaller than 24. If you instead drop a B frame, okay, then what you see is the following. This is one GOP. You get one correctly, you get two correctly, and let's say this three is dropped. So the B-frame, will say, I'm going to look at the one before me. Okay, or after me. But one before me the first LP frame before me is already properly received so I'll just use that value. So I'll use two. ... And then the next one is properly received. So you got one, two, two, four and you could easily see that the error in this case simply restricted locally to this B-frame which is four, which is less than sixteen which less than twenty-four. So this difference explains how quantifonts impacts of dropping an I versus P B-frame. Alright. So now, in the last and the final module of this lecture before the advanced material part, we will now talk about application and transport layer below the compression layer and see some of the networking ideas that goes into supporting video there.