So you've submitted your context to the annotation pipeline. You've submitted your job, and you found out that your annotation is complete. How do you look at that job? How do you find what kind of files are there and what are the next steps? Well, first let's find the job. You can do that in a number of ways. You could go up into Workspaces, click My Jobs on My Genomes. Or you could come down here to the Job Monitor and click on that. That opens a page that shows all the jobs that you've ever run in Patrick. Perform a lot, so I want to narrow down my search. So I click on this down arrow next to All Services and I click on Annotations and that shows me all the Annotations Jobs that I've run in Patrick. I find the one I want and I click on it and you'll notice that this populates the vertical green bar with possible downstream actions. I can view the job or I can report it when I have a problem with it. I want to view this job, so I'm going to click on this and it takes me to the landing page for this particular job. Notice across the top it shows me where this Genome is placed in my private Patrick Workspace. There are also some hyperlinks to the Genome which we talk about in a different tutorial. The annotation job result, this is giving me information about the job that would submit it. Says it was a Genome Job and gives me some information here about it. This is the Job ID, not the Genome ID, the Job ID. If you had a problem with this job, the developers would need this number to be able to figure out what went wrong. It tells me when the job started, when it ended, how long it took? Generally annotation shots are pretty fast. And if I were to click on the parameters, it would show me all the things that I selected when I originally submitted the job. Now, one of our other tutorials we talk about all of these download reports and files and folders. But this particular video is going to discuss the genome report HTML. So I'm going to click on their through that has it and notice that it populates this vertical green bar with possible downstream factions. I'm just going to click on View, this genome report is pretty awesome. You will be amazed at the level of detail that's here. If you have questions or confused about different aspects because I'm just going to quickly go over each of these in this video, you can click on this tutorial link here at the top of the page. So first, let's start with this table and it's saying results for this number. What is this number? You can see here that that is the Genome ID that is the unique identifier. It's been assigned to your genome that you haven't Patrick. There could be a number of genomes with this same exact name. This is the name I gave it when I submitted it, but there could be a number of genomes that have that name. But only one genome will have this particular identifier. And if I click on it, it will open up a new tab that will take me to the landing page for that particular genome. And we describe all these different tabs and water in lemon water in each of these tables in another tutorials. I'm not going to really talk about that today, but I want to go through the genome report. The reference genome is the idea of the highest quality genome of the same species or genus that my genome, were the genome that you annotated was compared to. The reference Genomes are always public Patrick genomes. If I click on this link, it will open up a new tab that leads me to that particular genome. That was used as the reference for mine and you'll notice it's the same species and string. Several of these following scores are determined by a process that we use to identify genome quality and describe genome quality. That process is described in this publication by Perello et al. A machine learning based service for estimating quality of genomes using Patrick. So any questions you have about it, I would suggest you go to this publication. And their pipeline uses checkM and checkM has something that a lot of bioinformaticians use to determine genome quality. And that's described in this paper. So let's. Start stepping through these. Course consistency. The course consistency is the percentage of roles. A nd buy rolls, we mean genes whose presence or absence was correctly predicted. A higher number indicates the genome annotation is more self-consistent. They may be thinking what do we mean by self-consistent? By self-consistency, we mean that the genes that we expect to be found in the presence of other genes were actually found and those that were expected to be absent were not found. And now, a lot of words. So let's think about it. An example would be let's say you have a bacterium that has a flagellum or maybe that you know does perform some biochemical function along in a biochemical pathway. If you know the organism does that, you expect them to have the genes to enable it. A high self-consistency means that they have all the genes to make the flagellum or to make the pathway work. Likewise, if the genome does not have a flagellum. You would not expect it to have those genes. Find consistency if the percentage of rules is exact number of occurrences was correctly predicted. A higher number for fine consistency indicates that the genome is more self-consistent. A lower number means that the genome is less self-consistent. Define consistency, number is always less than or equal to the course consistency number. Completeness is that percentage of universal rules or genes that appeared in the genome. A higher number indicates the genome is more complete. A lower number indicates that jeans normally in this taxonomic group are missing. So this is a pretty good score. Contamination is an estimate of the percentage of the genomes DNA that does not belong computed by locating the universal jeans that occur more than once. A higher number indicates that the genome is contaminated. A lower number indicates that the genome is relatively clean, unlike the other three numbers. It's better if the contamination score is lower. Evaluation group is the taxonomic grouping that was used to determine the universal rules. You'll notice it's bright blue and underlined indicating a hyperlink. If I click on that, it opens up new tab that shows me the genus landing page for all the information about Brucella and all the genomes that we have there. Contact count is the number of context in the genome. Forgiven assembly size. A lower number indicates a better quality genome, 63. I know that brucella have two chromosomes, a large chromosome and small chromosome. So 63, I would prefer if it was smaller, but it is what it is. DNA sizes, the number of base pairs in the genome. This number provides context for the N50, which is the next row. The N50 is the number of base pairs in the smallest context such that half the genome's DNA is in context of this size or larger. The closer this number is, the DNA size and a bacterial or Archaeal genome, the better quality the genome. The context L50 is the smallest number of contexts where the lengths some makes up half the genome size. These are all just measures that. Different people have established to determine quality of a genome. The over present roles are the number of roles or genes that were found too many times. All of the rules in this set will be listed in the problematic rolls report. We know a lot about Brucella. We actually know a lot about Brucella pinnipedialis and we know what to expect in one of those genomes. If we have more of a certain gene or unexpected genes, this is telling me we have too many of something. Likewise, under present rolls, these are change that we expect to be there, but there found too few times or maybe they're missing. These are the number of roles or genes that were found too many times and they'll be listed down below in the potentially problematic roles. Predicted roles are the total number of roles or genes that were examined by the quality pipeline that we described earlier. Completeness roles are the rules for gene, expected Could be singly occurring for the taxnami level or the completeness group for that genome. So these are the genes that we expect to see only one time. This wouldn't be something like transposable element genes or something like that that you often see peppered throughout the genome. Total distinct roles are the roles or genes that occur in the genome. Protein encoding genes with functional assignment, these are the genes that generally have a good name and we have an idea of what their function is within the organism. And then on the flip side, protein coding in genes without a functional assignment, something like hypothetical proteins. Which if they have a good start, they have a good stop codon, but we really don't know what they're doing. The percent of protein encoding feature covering or the percent of all genes that have a functional assignment. The percent of features that are hypothetical are the percent of all genes that are annotated as being hypothetical. And then we have the percent of features that are in the local protein families. We've talked about this when you submitted your annotation job that Patrick assigns. Two groups of protein families, the local families are the genome specific families, the global families are the cross genome families. When you're looking at measures of the quality of your genome, you want to see not necessarily the global families but those genome specific ones. And see how many of the proteins annotated there are following within the protein families that you expect for brucella. So I think this is a pretty good number. Now this is where this report gets real, I mean that was interesting, but this gets really interesting. The potentially problematic rolls. Try saying that five times fast. This is the heart of the genome quality report. The main table lists all the roles who's predicted number of occurrences was different from the actual expected number along with the analysis of the individual features implementing those rules. The table has five columns. The first is role. Role is the description of the functional role or gene that is potentially problematic. This corresponds to the value of the product column on the Features tab. So it's identifying that this is the role that it has a problem with. The predicted count is the number of features or genes implementing the role as predicted by the quality tool. Let's start with this one, this three oxy tetranate kinase. Looking across all brucella it was predicted that this genome, because it falls within the brucella genus, would have one gene. This leads us to the annotated count, which is the actual number of features genes annotated as implementing the given role. My genome has two, one is predicted, mine has two. So the next column is the feature link. The feature link is a link for viewing the features implementing the role. If there are no features, there won't be anything in here. But if you click on this, it'll open up a page that shows the two genes, it sometimes takes a little while to load, that are identified as being a problem. We'll come back to this in just a second. And then there's the comment section. And the comment section contains text that will help you determine why this role is problematic. If you see something like universal role, if the comment includes that it means that the role is considered to be something that you would expect for this gene's taxonomic grouping. And look what it tells me, it tells me that here is this gene, and in Patrick these are unique identifiers for each gene. Fig bars, how they always begin, this number is the genome ID. Notice up here it's the same number as we saw up here. So here's the genome ID and then it's saying now if there's a gene, and this is the 357th gene annotated in this genome. It tells me it's on the second contig. And then it tells me that it's got another one right next to it. Notice that this is 357, this is 358. They're on the same contig, and when they look at There's and compare it. Notice this number here, 120576.3 we go up to the reference genome here which we opened earlier. This gene is closest to this gene in the reference genome. So my guess would be that there is either a sequencing error or this gene has been pseudogenized because there right next to each other. And if we look at it, here are the genes that tells me the protein families. And I could click on this plus sign and I could see, want to see how big the proteins are for those. And you can see that this one is big in this one small and I would guess that for this particular gene it would be something along the same size, so let's click on that. I want to look at this overview of this to see how big this gene is in Patric. It's 301, so that's interesting. I want to look more at this and see what's going on here, but we'll do that later on. This table tells me all the genes that had problems being annotated in the genomes. I mean, you could spend a while looking into this and trying to figure out exactly what's going on here as you scroll down. Then we have the potentially problematic rules by Contig. This will show us for each of the Contigs there were 66 I believe. Sorry for scrolling up on you. 63 Contigs in this genome. And this is telling me that up those contigs these particular contigs have issues and have problems. And I could click on this that saying that on this contig, which is this size, 350,000 base pairs long, being a little bit larger than that. It had 158 genes on them, six are problematic. If I click on that it's going to show me what those problematic genes are, and where they are on the genome. And I can click on any one of these here and that can take me directly to a feature page that shows me more information about that particular gene. And so that ends the genome report. I think you can agree with me that there's a lot of detail in this. It only tells you the things that are really wrong with your genome, but you know a lot of things are right with the genome too, because it had a good consistency, good completeness. So I encourage you to look at it in detail. And especially if it's jeans that you're interested in that appear to be broken, you may want to dig deep into this and see that 'cause that could be something interesting that you could say about your genome. Next time we'll talk about the other files that you can get whenever you annotate the genome in Patric. It's time for your third annotation assignment. This one isn't downloading documents when your job is complete. Once you get comfortable with knowing what's there and looking at them and figuring out if you need or want any of them. This is a typical download folders that are available for any Patric job. I've started the ones that I think you should go ahead and download or even open and view everyone that's gotta dot TXT. That's a text file you can view that within workspace. The other stuff you'll need to download. I would suggest that you open them all as text document so that you can see them. That'll give you a good idea for what's there. I'm not asking you to download the tar, ZIP file because that's just got everything that's here. There's a nice text to an Excel file that showed the different genes that I'm quite frankly a big fan of. The genome report, I'm a big fan of him that you can view within the viewer as well. I'm not asking you to open the alignments, it's just the alignment of one gene or any of the JSON files. Unless you're a real computer person nobody wants to see those. So go ahead, download them, take a look. The more knowledge you have, the better informed you are for downstream decisions. Good luck, bye.