Next I wanna show you how to view the top or bottom of a file.
This is a particularly useful to deal with genomic files
since they're usually very large.
And regular text editor either won't open them or
dramatically slows down your computer.
But genomics files usually contain structure data,
therefore we can view only the first few lines or last few lines to get
a sense of what the data looks like without opening the entire file.
The commands are head or tail.
Using head followed by how many lines you want to see, followed by the name
of the file, we will get a first a few lines show in the terminal.
For example, here I wanna
see the first five lines of this, mouse genome file.
It's a bit hard to tell how many lines are here since they're wrapped but
if we count the genes we can
see it's five, same as had when we're using tail.
We, we're showing the bottom five lines of the text file.
One of my favorite Linux command when
dealing with large files is command that counts how many lines
are there in the file.
You remember how annoying it is to open a large table in Excel and
drag your mouse down to the end just to see how many rows are there.
While in Linux, you can just type wc -l followed by the filename and
the number of lines which showed up instantly.
Here we have 30,000 something transcripts in the mouse genome.
The last thing I want to show is a grab command it's extremely handy to
handle simple text files.
For example, in this genome annotation file each line contains a transcript,
therefore for a single gene,
multiple lines exist if it has more than one transcript.
Now, I want to see all transcript for Brca1, and
can use a grep command to extract all lines that match the string Brca1.
It turns out, Brca1 has only one transcript.
Using grep, we don't need to give the full name of the gene.
For example, I want to see all transcripts of genes
that start with fut I can type grep fut no numbers,
and I'll get a full list of genes beginning with fut.
As I mentioned, grep has a lot of parameters we can play with.
Where to find them?
As a matter of fact, most Linux commands come with a set of parameters.
To use them, we need to read the documentation about the commands.
Functionally this can be done in the terminal.
This is what Linux group
call manpage simply typed man followed by the name of the command and
the full documentation about the usage will show up in the terminal.
Name, synopsis, description, etc.
You can type Q to quit the man page and get back to the terminal.
So to remind you, this is a screen shot of the commands.
And if you are still cloudy about these user unfriendly commands,
get a new laps view like this, which has printed all essentials upside down.
The last prep before we start a pipeline is about a format, the bio formats.
As we know,
sequencing data are very large, therefore it's stored in specific formats.
The most frequently used format is Fastq.
For each short-reads,
four lines of information is written in the fast profile.
The first line starts with @, it's the sequence ID.
It's usually generated by the machine.
And the second line contains a sequence of the short rays in ACTG codes.
The third line starts with the plus symbol, and it's optional.