The most exciting standard from maybe an academic point of view is MPEG-4. It deviates from all other standards but it also keeps some backward compatibility. This is the topic of this segment. One of its major differences is that it looks at some level at the content of the video. The video now consists of video and audio objects, which are encoded and manipulated separately. Because of that, based on a script, new scenes can also be composed. We describe in detail, how the video objects are encoded, in other words how their shape and textures are encoded. We also discuss the hybrid scenes that can be encoded by MPEG-4. Scenes consisting of synthetic and natural objects. So let's proceed with this really exciting topic. MPEG-4 actually is the exciting standard that gave a fresh look at video compression. Up to that time and actually after that time as well after MPEG-4 all the standards treat the video in a content agnostic way. They take each frame they break it up into macro blocks and blocks and proceed that way. With MPEG-4. Now, we are given the ability to access and manipulate individual elements that make up each scene. So we're looking at the content, and the special decomposition of the content in a frame. So in a cartoon, the reposition, deletion, alter movements of individual characters within a scene can be captured. So this is a very representative slide, because it stro, shows the structure of an MPEG frame. We see here that there is a scene depiction which consists of. Arbitrarily shaped objects, like the father and child, and associated with them, there's an audio object. The floors and walls are synthetic, and sprites are used to generate them. And this is a concept from computer graphics. Then there's an application screen here, this website. The furniture are synthetic objects. And then there is an HDTV set, a home DVD TV frame, and so on. So you see here that we have a hybrid scene that consists of synthetic as well as real objects. And the objects are individualized, like we have the object of this father and child and there's the video object and the associated audio object. So the MPEG-4 bit stream can accommodate such a composite scene. So this is a major departure again from all future and previous standards. So as demonstrated by that synthetic scene I just showed MPEG-4 deals with the encoding of both audio and visual objects. And it provides video and audi-, audio coding tools. It's merging natural and synthetic data, and that was the main or the main intentions of that slide I just showed. And also, provides a robust bitstream syntax and flexible system layer for interactivity. So, let's see what are some of the applications, the technology behind it, the composition of a frame, and then how video objects are handled. How the, how are they represented and encoded. So the application space of MPEG 4 covers from the traditional applications, such as surveillance. Say mobile multimedia, but now additional applications arise. Quite a few in entertainment and interactive games and DVD. Also applications having to do with content-based storage and retrieval. And also with interactive TV. And these newer applications are based on the fact that the shape of objects is described. And therefore I can manipulate visual objects, as well as audio objects. Therefore I can either utilize them to do search. Or I, can utilise them in having an interactive game application or interactive TV for that matter, and also I can utilise them to recompose a scene based on a specific script. We see here an MPEG-4 scenario. So there's a scene it decomposed then. Elementary streams are sent over to the decoder, and this elementary streams include primitive audiovisual objects. So the scene again has been decomposed into these audiovisual objects. Then utilizing a scene description as shown here, we can render and propose the scene according to the script of this scene description. So one can envision possible applications whereby, depending on the characteristics of the various markets, a different scene description is used. And certain objects, either they're place in a different position, or certain objects do not appear at all, based again on the preferences and the restrictions of the market. So when, when it comes to the coding of visual objects within MPEG-2, natural textures, images and videos can be accommodated. Then, frames can be progressive or interlaced. And here is a major innovation, is that we can encode arbitrarily shaped objects. And we're going to see exactly how this is done. Scalability is handled, and addressed, and also sprites are part of the equation. These are again, computer graphics modeling techniques. Also, the resilience of the bitstream to errors in the channel are addressed. When it comes to synthetic objects, facial animation can be handled. We'll talk a little bit about it, animated meshes and body animation in the second version of MPEG-4. So the scene is decomposed into video object planes, VOPs. So this particular frame is decomposed in three VOPs, VOP 0 representing this person. VOP 1 the car or at least part of the car since it's included by the person. And VOP 3, the remaining of this scene which is the house and the background. So as we'll see each of these VOPs is encoded separately and then they're just matched together to recompose the scene. This is a very illustrative slide. It shows that the video object planes are composed from shape and texture. So, VOP 0 here, shape is the whole frame, this is the background and this is the associated texture, video object plane 1, is composed of these children so this is the shape, at the binary version of the shape, so where is white is where the objects are and for these we have the textures. And then there is this moving I guess slogan which we show here both the shape and the texture. So if all these are put together, we compose the frame which is shown here. So, the main idea is that for each video object played, both the shape and the texture need to be encoded. As far as the texture goes, as you'll see right away, the techniques that have been developed prior to MPEG-4 such as MPEG-2 and H.263 will be realized coded detector but new technology is needed to encode the shape of the object. We show here the block diagram of an MPEG-4 coded, decoded. The, the video frame is inputted to the coder, the video object planes are identified, defined and this gives rise to the scene and object description. Then each of the video object planes is encoded separately, all is encoded down here all the information is multiplex is sent over the network. At the other end at the decoder, the demultiplexed each of the VOPs is decoded. And the decoded information along with the scene and object descriptor is utilized to compose the output video frame. Audio of course is also decoded and it's output over here. What is different really when we look at MPEG-4 and the previous and next standard is how the VOP encoding and decoding is taking place, and this is what we are going to discuss next. As mentioned earlier, and will also be shown later, when comes to the encoding of an object the texture of the object and the shape of the object are encoded separately. So this is what this VOP encoder schema, schematic is showing here. So, in this upper part, the texture, the underlined texture of the object will be divided in macroblogs, motion estimation and compensation will be performed, then the displaced frame difference will be DCT'd and quantized, and this will become the encoding of the texture,. While down here is the prediction model, the motion vectors that need to be sent to the decoder. So this upper part that I just mentioned represents a standard H.263 MPEG-2 video encoder. What is again different with mpeg four is that there's a shape coding module here that is going to encode the shape of the object and all these three pieces of information will be multiplexed and sent over to the decoder. So, what to look at next is how shape encoding is carried out in MPEG-4. There are different ways to describe the shape of objects. One way first of all is, is to describe the boundary of the object while the other is to treat the shape as a binary mask as shown here. So the approach that was followed by MPEG-4 was to encode the mask that represents the shape. So, a bounding box is utilized around the object of interest. So, then, this bounding block bounding box. Is divided into blocks, as shown here. And then clearly, there are solid blocks like this one, there are blocks that belong to the object. Also blocks that belong to the background. So these blocks do not need to be encoded. I mean the color needs to be signaled to the decoder. But the blocks that really require encoding are these boundary blocks, which are referred to as Binary Alpha Blocks, or BABs. So these blocks contain both background and object pixels, and again this is a mask, so the object let's say is represented by zero and the background by one as shown here in color. So the question is how can we encode this binary block there, the block which has zeros and ones. And clearly the material that we covered two weeks ago, when we talked about lossless compression and facts and arithmetic coding and content adaptive arithmetic coding could come to mind. Because that's exactly what MPEG-4 is utilizing. So let's look at the details next. So first of all, similarly to the intra and inter blocks, when we encode texture, there are intra shape, and inter shape coding. So with inter shape there is motion estimation compensation involved. While with intra the shape is encoded by itself. So, for these binary alpha blocks, or BABs, the binary shape pixels are encoded using context based arithmetic encoding, CAE. So to encode this pixel, which is a binary pixel, these ten neighbors are used. There are, therefore there are 2 to the 10th, 1024 different contexts. So there are 1024 different arithmetic coders that are utilized. But they all use the same machinery. But the probabilities of zeros and ones for each of them changes. So we look at the ten neighbours and pull out one of the 1024 arithmetic coders to utilize to encode the value of x. If either shape is used then we use the current frame, the current BAB but also the motion conversated BAB. So the neighborhood then is as shown here. There are four pixels in the current frame, these four neighbors. And then there are the five neighbors in the previous frame. So in this particular case there are only 5 plus 4, 9 neighbors and therefore 2 to the 9th or 512 different contexts. So this is a frame from the MPEG-4 [UNKNOWN] effort. And, we can see here. But, when the video plays. So what we show, is the boundary of the objects. And we see some of the challenges are that the objects are deformable, they change shape very easily. And then we have three objects but these three objects merge at some point and become two objects. Right. So now we've three objects. Now we have two, since the ball and the child merged and so on. So this is certainly a challenging situation in, in coding in this bounded information, or shape information. So, the texture is encoded the traditional way. In other words, we have here the mask identifying the object, and then we see that this macroblock base mask is extracted. Then each Macroblock is encoded, the good old standard way, it's with intra, its encoded by itself, if it's intra, motion compensated, and the prediction error is encoded. And after I reconstruct this, I used the mask, I superimposed, the mask on that, and I extract the, both the shape and the texture of this visual object. Sprites are also accommodated by MPEG-4. So we see here a, a scene, so the idea is that the background is encoded using, using sprite. So first some time is spend as a mosaic decomposition or composition of the background and then the background can be manipulated by motion parameters such as the pan, rotation, and zoom. So then the idea is that the background has already been transmitted to the decoder and then at each time instance only this object of interest is extracted and encoded. So what is sent over to the decoder is this visual object plus parameters. Regarding the background. So based from these parameters the sprite based representation of the background available to the decoder will be manipulated and then the object will be added to the scene to compose the final scene. So, you can see there is a much wider view of the this, again, mosaic, mosaic reconstruction of the background. And, based on these motion parameters, we can zoom into a section of it, and we see, again, the visual object, this tennis player. When it comes to face objects, then they're. These are a set of facial definition parameters, and facial animation parameters, and facial interpolation parameters, these three set of data that can be used. So, the facial definition parameters, they define the proportions and the shape of the face. Then the facial animation parameters animate vertices of this wide frame that underlies this, this model. And there's specific set of parameters that control, for example, the inner lip, the outer lip movement, the nose, eyes and so on. And then the facial interpolation. Transform is utilised to interpolate missing frames or to increase their tempo of frame rate. So there face is defined either through a 3D feature point set or a 3D mesh or scene graph, then there is facial texture that is super imposed on that and the facial animation parameter, that further controls the parameters of the face. So, see here an example of the underlying wi, wide frame of this face model. So, it consists of these vertices that can be manipulated by the facial. Animation parameters and then we see the texture of, of the face of a person that is masked or super imposed on this wide frame. So by controlling the underlying vertices, the underlying wide frame, we can animate this realistically look, realistically looking person.