In this lesson we will discuss the biomaRt packets. The biomaRt packets is an interface from R to a so called biomaRt. A biomaRt is a front end for a database for biological database. The idea is, any type of database out there that wishes to expose its information to the Internet, can set up a biomaRt interface to the database. And then users like us can access the database using a common set of tools to many different databases. So multiple resources have set up biomaRt interfaces to their data. For example Ensembl, from the European Bioinformatics Institute. The Hapmat project and Uniprot. So biomaRt uses this biomaRt interface to acquire a database, and what you can get out of the database, of course, depends on what is in it. But, it turns out that there's a couple of databases inaccessible through biomaRt that contains a wealth of information. So, we will start by loading the packets And the first thing you do when you set up biomaRt is that you choose a database. And that consists of choosing a database and choosing something known as a data set inside of the database. In biomaRt lingo, a database is called a mark or a data center is called a data center. So there's a function called list marks that shows which kind of marks are available. This is something that is continuously updated. And here I show the first six of them and EBI is heavily represented here, and a very common database is Ensembl Genes 81. So we're going to pick that. We call upon Ensembl and if we print it, we are just being told that we have picked a database, we haven't picked a data sets. So inside a database there's multiple different data sets. And let's try and have a look here. And we can see that, again we print the first six data sets. This is genes, and they're from a lot of different organisms, so each different organism is its own data center. And if you look down the list, you'll of course find human, and that's if you use the list example is what I think a lot of people are interested in. So, now I've picked, I have a pointer to the database. It's called Ensembl. It's the Ensembl mark and the eight stream sample data set. At this point in time it's important to emphasize that whatever you access from the database changes over time, and sample is very aggressive in pushing out new versions. We can see here on the screen that we're currently at version 81. And at any given time a new version can come out. It's possible using, shall we say, more complicated set ups to access old versions of the data base. But I highly recommend that you query the database, you save whatever comes out of the database and you keep that going forward before we go on, let us load a little bit of the database from the internet, so it's possible to access the database by going to the Ensembl website. So I'm switching over to Safari. And here I'm basically at the Enseml.org hit. And we can see here that the first thing it asks us to do here is choose a database. And we pick the Ensembl Genes 81, and here we have a list of datasets, and we can see the most common species up here in the beginning. And we can actually see that when we are querying homosapiens genes here the genome version is [COUGH] the human built 38. So Ensembl in general is very aggressive about pushing the latest and greatest. And then once you pick the data set, you build a query which means you ask for some data from the data set. That consists picking something called filters and attributes and values, and we can discuss that when we go back. I find it useful sometimes to go to the database through the web interface to construct my query. Yeah, it's useful sometimes. Let's go back to our studio. So we're going to do a little example here, we're going to say that we have some identifiers from an alphamatrix gene expression array. If you remember from earlier sessions, we have discussed alphamatrix gene chips, and how. [COUGH] What you get out of such a gene chip is something alphamatrix calls a probe ID. [COUGH] And this probe ID needs to be translated into a gene. I'm going to say I want, I have a list of alphamatrix probe IDs, and I want to get back the gene name associated with these probe IDs. [COUGH] A very simple query. So I'm going to start off with something I call values with a three probe IDs, in reality of course here you would have 12,000 probe IDs, however many probe sets you have on the array. The query I just set you run you sing the main work house is not about get capital BM or get dialogue. And you put in some attributes and some filters. [COUGH] So what is left? We have some attributes. Attributes are whatever you want to retrieve from the database. Filters is a way of selecting what you want to retrieve and values are the actual values of the filters so what does this mean? Well first of all I have a filter and an attribute called FA_HG_U133_+_2. That is the name [COUGH] of the micro array. This is alphamatrix micro array and why do I have it both in filters and in attributes? Well, you see here in the return object that I have basically a data frame with two columns. One is the gene ID and one is the alphamatrix ID and I need both of these two values in order to link up the probe ID to the gene. If I didn't ask to get the probe ID returned by putting in the attribute, I would just get a list of example 209s but I wouldn't really know which IP map to which probe set so that was the attributes. To fill those, is in this case here I want to give it a set of values, that was my little vector here of identifiers, and these are values that would be interpretive in context of the affy_hg and so on, of this particular variable. And I only want to retrieve data that is kind of associated with this particular thing. So this is how you set up a query. You select attributes, you select filters, you put in values. And the real power here is figuring out what kind of query and how do I combine that with filters and also get useful information out. So in order to figure out what I can really do there's a couple of functions there's list attributes and list filters so if you just call list attributes we actually get a rather or substantial number of attributes back there's 1,210 attributes. And we can see here that there's a name. [COUGH] A name is what I'm going to put into my getBM command and then there's a description that is kind of more a description of what comes out, and some of these are obvious and some of these a little bit more arcane. And sometimes we need to go to the Ensembl website to understand what goes on there. When you look at the bottom of the attributes. That's not a good example. So let's say tail, n = 100. Okay, so that's not the bottom. Let's take the last 500. You're going to see in here that there's a lot with like gene IDs and real organisms that are not human and this here has to do with the ability to get a human gene id and convert it into the matching gene in a different organism or the ortholog. So basically for every kind of species we have access to a sample, and they have a lot they have a number of columns here that translate genes from one species into another. That's a bit of a pain to look at in the output because it kind of hides a lot of useful information, in my opinion. But you go through this one here, and you find your attributes. In the same way, filters. You can use list filters and you can look at what they are, and there's a somewhat less filters here. So, setting up a query is like going through all of this and understanding it, and it can get rather daunting. Now, one way to help you a little bit is that attributes we had 1,200 attributes organized into something called pages. And pages is one of these things that are kind of internal to the database structure, but that you sometimes get exposed to as a user. Okay, so let us just say the attributes are grouped into pages. So I can list the different pages, which are feature page, structure, hold locks, snips and so on. And here are the hollow logs are all the things that I was like ranting about a few minutes ago that I did like the amount. So I can ask to get a list of attributes only on a specific page. And now I have something that usually is a little bit more tractable to look at and contains the things I really want. Here we can see a lot of alphamatrix, IDs, and things that kind of make sense, mere base. Translation, line transcript, blah blah blah. All things that we're usually interested in. So what's up with these pages? Well you cannot make a query with attributes that belong to different pages. So in order to link things together there are some attributes that belong to more than one page. And if you try to make a quarry where you want attributes that span more than one page you get an error back say you try to use this. And you have to construct your quarry in such a way that you don't do it. So the way to get around that Is to query the pages individually and then somehow merge the results back. So you get a set of results for each page you query. And then you have to merge them based on some kind of identifier. Finally, sometimes when you use biomaRt you can duplicate rows backing a data frame, and that is a consequence of how the database works internally. You should just know it because sometimes it's a little surprising. So you can do a lot with biomaRt, and it has a really good vignette that I highly encourage people to read. There is, I think, around ten different, more and more complicated tasks where they show how to construct a query and look at the output for biomaRt. So this here is a awesome package. It was quite a boost to viral doctor, people who use viral doctor when this package was introduced by now a long time ago. It's been in heavy use and you'll find that a lot of people yeah have access to it.