To analyze the statistical significance of the identified peptide spectrum matrix. We will introduce the concept of Spectral Dictionaries. Imagine that a PSM search of 1,000 spectra from a human sample against the human proteome results in 100 peptide spectrum matches whose score exceeds a threshold. What is a fraction of erroneous peptide spectrum matches among these 100 peptide spectrum matches? I'll give you a hint of how it may be possible to answer this question, and here's the hint. Let's repeat the same experiment for a randomly generated DecoyProteome of the same size as the human proteome. Of course, we don't care about peptide spectrum matches identified in the Decoy Proteome: there are simply statistical artifacts. What we care about is the number of such hits in the Decoy Proteome. For example, if you identify 5 peptide spectrum matches in DecoyProteome, then you expect that five over 100, or 5% of PSMs identified in the real proteome are incorrect. And therefore, we define the notion of "false discovery rate" as simply the ratio of the number of peptide spectral matches identified in DecoyProteome over the number of peptides spectral matches identified in the real proteome. And when we run this experiment on the T-Rex spectra, then we will identify 27 peptide spectrum matches in the UniProt+ real database and only 1 peptide spectrum match in the DecoyProteome of the same size for a threshold of 100. Which means that, in this experiment, FDR will be a respectable 3.7%. But does it mean that we just found approximately 27 T-Rex peptides? Not quite, because many of the peptides that we identify are simply laboratory contaminants that are present in every experiment. For example, keratin from human skin. There are currently millions of tiny particles of my skin and skin of the people who pass through this room floating in the air in this room. The questions that we have to answer to figure out which of the identified peptide spectrum matches are correct is how to estimate the statistical significance of individual peptide spectrum matches rather than the bulk false discovery rate for the entire sample? To answer this question, we will bring in a monkey. Give this monkey a typewriter and let the monkey type random keys on this typewriter for a very long time. Afterwards, let's check how many correctly spelled English words the monkey generated. We can use Webster dictionary to check it, or whatever monkey dictionary to check it. In this particular case, the monkey generated 13 correctly spelled English words. Does it mean that the monkey can spell? Well to answer this question we probably need to evaluate what is the expected number or words from the dictionary that appear in a randomly generated text or in other words, we need to solve the following: The Monkey and the Typewriter Problem. Find the expected number of strings from a dictionary appearing in randomly generated text. But what does it have to do with mass spectrometry? Well, at the same time, we want to solve the following mass spectrometry problem: To find the expected number of high-scoring peptides, which is to find the expected number of high-scoring peptides against a given spectrum in a decoy proteome. The input to this problem is a Spectrum, an integer n, and a score threshold. And the output is the expected of peptides in a decoy proteome of length n that score a least threshold against Spectrum. You may be wondering what is there similarity or any relevance of the expected number of the high-scoring peptide problem and the monkey and the typewriter problem. It is not clear that these problems are equivalent, but they are. To explain why this is actually the same problem, I will introduce the notion of spectral dictionary. Dictionary of spectrum under a given threshold is simply the set of all peptides with a score of at least threshold against Spectrum. There will be many of these peptides, and as soon as the generate the dictionary for a given spectrum, we can reformulate the expected number of high-scoring peptides problem as the following problem. We want to find the expected number of peptides from Dictionary occurring in a decoy proteome of length n. And let's make one more step to reformulate this problem. We've reformulated its output. Let's now reformulate its input. And the new input will be simply all peptides from Dictionary of Spectrum and an integer n. And output, the expected number of strings from the dictionary occurring in a decoy proteome of lens F. Take a look at this problem. This is exactly the monkey and the dictionary problem for a specific set of peptides given by dictionary or spectrum under a given threshold.