Hey, welcome back! Let’s continue to learn about bioinformatic resources.
In this Unit, let’s look at another centralized resource, the UC Santa Cruz Genome Bioinformatics.
The UCSC Genome Bioinformatics has several useful resources. The main part of it is the UCSC Genome Browser.
In addition, it is also the portal for the ENCODE project and the Neandertal sequencing project.
Let’s look at the UCSC Genome Browser first.
The UCSC Genome Browser uses the genomic sequences as the backbone to integrate genomic and genetic data.
As of the end of 2013, it has genetic data and genomic data and annotations for 46 mammals, 18 other vertebrates,
13 insects (11 of which are different drosophila species), 6 nematodes, and 3 different deuterostomes.
For some reason it doesn’t include plants or microbials.
The UCSC Genome Browser has the most extensive genomic annotations for mammals, especially human.
This is a typical screen in the UCSC Genome Browser.
All the annotations are aligned to the genome sequence using the genomic coordinates as the backbone.
You can specify where on the genome you want to look at by searching with gene name or ID, chromosomal coordinate range, or keywords.
Once you land at a genomic location, you can move in the 5’ or 3’ direction or zoom in and out.
One word of caution is that,
whenever you communicate with someone else about genomic coordinates, make sure you specify the version of the assembly you are referring to.
The so-called reference genome is not static. It gets updated once in a while as new sequencing data becomes available.
Most people now use GRCh37 or hg19, but some people and some programs still use hg18 which has different coordinates.
Below the genomic coordinates, the annotations are shown as “Tracks”.
For instance, the “RefSeq Genes” track shows the genes that are encoded in this genomic region. Tall bars indicate exons.
Arrows indicate whether the gene is on the plus or reverse strand.
Data from the ENCODE project are shown as tracks, such as the transcription binding sites from ChIP-seq experiments.
Epigenetic data are also available as separate tracks such as DNA methylation, histone modification, and DNasel hypersensitivity.
An important feature of the UCSC Genome Browser is the conservation profile that it calculates across species.
Here, for the human genomic region you are looking at, the sequence alignment and evolutionary conservation across 100 vertebrate species are shown.
The alignments were done by lastz and Multiz and the levels of evolutionary conservation were calculated by phastCons and phyloP.
Such large multiple alignments are extremely computationally intensive and impossible for small labs to do.
It took the staff at UCSC 15 years of CPU run-time (not human time) in 10 million individual “jobs” and 100 cluster runs.
Finally, what is not shown on this slide are lots of other useful tracks on repetitive elements, genetic variation, known disease or phenotype, and so on.
As the end of 2013, the UCSC Genome Browser integrates over two hundred tracks of data onto the whole genome sequences
including expression, variation, conservation, and so on. Each track consists of many experiments.
So the data is quite rich.
The UC Santa Cruz Genome Bioinformatics is the data portal for the ENCODE project. ENCODE stands for the ENCyclopedia Of DNA Elements.
It is a large collaborative project funded by the National Human Genome Research Institute (or NHGRI) of NIH
to identify and analyze all functional and regulatory elements in the human genome.
ENCODE data is freely available at UCSC for download. The data can be browsed through the UCSC Genome Browser which I showed you earlier.
Another resource available from the UCSC Genome Bioinformatics group is the Neandertal Genome.
The Neandertals are the closest extinct relatives of human.
They lived from several hundred thousand years ago until their disappearance approximately 30 thousand years ago.
The DNAs were extracted from three Neandertal bones discovered from a cave in Croatia.
The Neandertal genome sequence provides important data for studying human evolution. The data can be accessed as--
The data can be accessed as tracks on the UCSC Genome Browser on the human genome page that I showed you earlier.
If you scroll down that page, you will see lots of tracks that you can expand and load onto the browser.
One portion of the tracks are the Neandertal Assembly and Analysis tracks.
You can expand to see all the tracks available for the Neandertal genome. Then you can select how you’d like the data to be displayed.
Now let’s look at a very useful software tool, BLAT, developed by Jim Kent at the UCSC Genome Bioinformatics group.
BLAT can find the genomic locations of the gene or protein sequences that you input.
It was designed for quickly search the genome for sequence segments with 95% or greater DNA sequence similarity (to a query DNA sequence).
or with 80% or greater amino acid sequence similarity to a query protein sequence.
It supports exon-intron structures.
It supports exon-intron structures. BLAT has both an online web-server version and a standalone version available for download.
Another useful tool is in-silico PCR that searches a sequence database with a pair of PCR primers that you input to find all the matches,
so that you can make sure that your primers can get you the desired sequences and nothing else.
It also calculates the melting temperature.
Now that you had looked at three large centralized resources, in the next unit, let’s look at some examples of individual resources.