You now know about the main evaluation methods for GANs. In this video, you'll learn about a few tricks. In evaluating GANs, the statistics about the real versus the fake data sets matter in terms of sampling. You'll see how they differ and I'll discuss a trick you can do after training to shift towards either greater fidelity or diversity. Also get hyped because I'll be talking about the recently developed center for human evaluation referred to as HYPE. In addition to sample sizes, how do you select which images to sample when evaluating with FID? For reals you can just sample randomly in a uniform way but for fakes what's typically done is to sample z values based on the training distribution of z values or the prior distribution of your noise vectors p of z. Since you'll usually train your GAN with noise vectors using a normal prior, meaning a z that's selected from a normal distribution like the one you see here where it's centered around a Mu of zero and has a standard deviation of one. Vector values that are closer to the zero value will occur more than those further away during training. As a result, when you sample using those values close to zero your resulting image would actually look pretty good but that's only fidelity because this also will occur at a loss of diversity. As you can see here, these two dogs look pretty good but they're less diverse and then the ones that are here might look a little bit more funky. You can adopt a fidelity and diversity by choosing where to sample. Sampling in the middle will get you more normal looking things. For better or worse, your sampling technique becomes actually an important aspect of evaluation and downstream use. Because your evaluation metrics like FID or inception score operate on the samples of your model, not your model parameters. That means that evaluation on a GAN is very much sample dependent, so this does matter. Following the observation on fidelity versus diversity, there's a really neat sampling trick that's used called the truncation trick. The truncation trick is exactly a trick because it's done after the model has been trained and it broadly trades off fidelity and diversity. What it actually does is truncate this normal distribution that you see in blue which is where you sample your noise vector from during training into this red looking curve by chopping off the tail ends here. This means that you will not sample at these values out here past this certain hyperparameter and this hyperparameter determines how much of your tail you'll keep, so you can have one truncate up here. You can have one truncate out here so that you keep more of the curve and so if you wanted a higher fidelity which is roughly the quality and realism of your images, you want to sample around zero and truncate a larger part of your tails. However, this will decrease the amount of diversity in your generated images because this is where your funky flying tacos are generated and other weirder stuff that hadn't gotten much discriminator feedback during training. If you want greater diversity, then you want to sample more from the tails of your distribution and have a lower truncation value, a truncation hyperparameter value. However, the fidelity of these images will also be lower because your generator isn't accustomed to getting its weights to make that noise vector into a beautiful realistic image. That is, it didn't get nearly as much feedback on its realism on those noise vectors sampled from these regions during training. Finally, you can definitely train your models on a different prior noise distribution from which you sample your noise vectors such as the uniform distribution where there is no concentration of possible values like up here. But the normal distribution is pretty popular in part because you can then use the truncation trick which is shown here to tune for the exact fidelity and diversity trade off you want. There really hasn't been a stark difference when people have experimented with different prior noise distributions. As expected, a model's FID score will be higher which means it's worse when there's a lack of diversity or fidelity. The samples might not do well on FID when using the truncation trick, though using the truncation trick might conform to what you want in an application you're looking to apply your GANs to. Something downstream where you need higher fidelity images and you don't want the extra gunk. Speaking of using those truncated samples because they seem better to the human eye instead of FID, using people to evaluate an eyeball samples is still a huge part of evaluation and often an important part of the process of developing a GAN. What's cool is that there are methods that systematically evaluate the quality based on principled crowd sourcing and perception tasks. In fact, one of the recently developed centers for GAN fidelity is one that I had invented in 2019 with other researchers from Stanford. Its name is a bit cheeky called HYPE for human eYe perceptual evaluation of generative models. HYPE displays a series of images one-by-one to crowd source evaluators on Amazon Mechanical Turk, it asks the evaluators to assess whether each image is real or fake and what's cool is that one version of HYPE called HYPE time actually flashes images at you for different amounts of milliseconds to see at what threshold you can figure out an image is real or fake. The better your generative model is, the more time a human needs to decide whether that image is fake. HYPE infinity takes away that time threshold and actually it's just looking through images. But of course HYPE is predicated around having great quality control and being able to manage learning effects as an evaluator gets better and better at evaluating fake versus real images over time. Despite all this, ultimately evaluation will depend on the downstream task you want. For example, if your GAN was to generate X-rays that had pneumonia in them, you want to make sure that your doctor would agree that there actually is pneumonia in there and maybe not an Amazon Mechanical Turk who doesn't have a doctor's degree or you probably don't want an ImageNet pre-trained classifier to extract features and to tell you whether or not that's close or not to a pneumonia X-ray image before you claim that your fake image is producing pneumonia on X-rays. Now you know how fakes are sampled by using the prior distribution of z during training. During testing or inference time you also have a new trick up your sleeve, the truncation trick which allows you to truncate the tails of your sampling distribution a little more or a little less depending on whether you're interested in higher fidelity or higher diversity. As evident in the sampled images, automated evaluation metrics still don't capture exactly what we want but are a good approximation. That's why human perceptual evaluation still sets the benchmark and gold standard and it's also a quick way to evaluate images. Thanks for spending this time learning with me and check back for just a little more about one extra, a new evaluation method as you get ready to wrap up this week and get to your assignment.