Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).
If you are new to the command line you might wonder how to open files to look at their content. Most files that result from ‘Next Generation Sequencing’ analysis-pipelines contain thousands of lines and/or columns of structured text. They can be huge in size and are generally not meant to be opened with conventional GUI programs. The commands that I introduce here equip you with powerful tools to look at and edit small to large data files, and to rapidly extract and summarize information from them - like, for example, the number of sequences in a fasta file.
The most common use of the cat
command is to display the content of
text files by typing the name of the file behind the cat
command and
hitting ENTER.
cat Testfile.txt
This command will display the content of Testfile.txt to the
screen. It does not make sense to display the content of very large
files in this way, but with the >
and >>
operators, you can
rapidly combine two files, even huge ones. For example, in the
following command Testfile1.txt
and Testfile2.txt
are combined to
CombinedFile.txt
.
cat Testfile1.txt > CombinedFile.txt cat Testfile2.txt >> CombinedFile.txt
The >
operator redirects the output of the cat Testfile1.txt
command (the content of Testfile1.txt
) to CombinedFile.txt
. The
>>
operator adds the output of the cat Testfile2.txt
command to
the CombinedFile.txt
. If we would use the >
operator instead of
the >>
operator in the second line, the content of
CombinedFile.txt
would be overwritten, not appended. So, the >
operator (over)writes content to a specified file while the >>
operator appends content to a specified file. If you use the >>
operator, the specified file needs to exist already.
The purpose of the three tools less
, more
, and most
is pretty
similar: to display the content of text files one line after the
other. Since they don’t try to read the entire content of a file at
once, they are very useful to look into large files.
Most likely, less
and more
are by default installed on your
system. To install most
, type
sudo apt-get install most
Enter your user password and hit ENTER.
Historically, more
is the oldest tool. Compared to the other two, it
only allows to scroll forward in a file, but not backward.
For all three tools you need to write the file name after the command to display its content. Try them out and compare the behavior of the three tools.
more Testfile.txt less Testfile.txt most Testfile.txt
The most
and less
commands give you many options to navigate
through and search for patterns in the text files. Here I show some
examples of useful options for the less
command.
Once you opened a fasta file , for example, with less
…
less Fastafile.fasta
… you can search for patterns, like the nucleotide sequence ‘GCTC’, with /
, like
/GCTC
With hitting n
you can repeat this search on the next lines in the file.
To show only those lines in the file that match the nucleotide
sequence ‘GCTC’, type this sequence after the &
sign:
&GCTC
To go to the last line of the file, just type G
, to go to the first
line, type g
. To close the file again, hit q
.
The less
command has more options than this. You get an overview of
these with the following command
less --help
If you want to get a rapid overview of the number of lines in a file,
the wc
command is the right tool. In output-files for example, where
every line represents a sequence, wc
is all you need to count the
number of sequences.
wc -l File.txt
The -l
option specifies that you want to count the number of
lines. The -m
and -w
options further allow you to count the number
of characters or words, which is a handy tool to check if the abstract
of your manuscript exceeds the upper limits of your target journal.
nano
allows to open and edit small text files from the command
line. It is not meant to open big files, like for example raw fastq
files that were obtained from Next Generation Sequencing. The name of
the file you want to open has to follow the nano
command. For
example,
nano Testfile.sh
Once you hit ENTER, Testfile.sh
will be opened and you can scroll
through it, and compared to the tools we looked at before, you can
edit the content of the file by deleting and adding text. At the
bottom of the terminal window you see some shortcuts for certain
actions. For example ^O WriteOut
or ^X Exit
. The ^
indicates
that you need to press CTRL+O or CTRL+X. Just open a file and try it
out.
grep
allows you to search for patterns in a text file. All matching
lines are written out to the terminal; or can be re-directed to a file
with the >
operator (described in the introduction section of
cat
).
This enables you, for example, to search for a certain sequence-id in a fastq file.
grep "gi|524845790|gb|AGR34129.1|" Fastafile.fasta
This command searches for the heat shock protein Hsp90 [Daphnia pulex]
with the gene-id ‘gi|524845790|gb|AGR34129.1|
’ in Fastafile.fasta
.
Since the gene-id search pattern as special characters (|
), you need
to enclose it in quotation marks (="=). The same applies when the
search pattern contains any spaces. The command above prints out the
line in which it found the gene-id.
>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex]
You can apply the search to more than one file at a time. To search for the gene-id in all fasta files in your present working directory, you could type
grep -n "gi|524845790|gb|AGR34129.1|" *.fasta
The asterisk *
stands for ‘any character’. The -n
option adds the
line numbers, that have matched the search pattern, in front of the
output lines. As I had two fasta files in my current folder,
alldata.fasta
and besthits.fasta
, the output is
alldata.fasta:131:>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex] besthits.fasta:35:>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex]
The gene-id was found in alldata.fasta
on line 131 and in
besthits.fasta
in line 35. You can also use grep
with the -v
option to inverse the search and find all lines that don’t match the
your search-pattern.
grep
can do one more thing that is very useful: it can count the
number of lines that your search pattern was found. In fasta files,
each sequence record starts with a ‘>’ sign. Thus, to count the number of
sequences in a fasta file, you can use the following command
grep -c ">" alldata.fasta
64
The output is 64; meaning that the alldata.fasta
file contains 64
sequences.
The head
command, followed by the name of a text file, prints by
default the first 10 lines/rows of the file to the terminal. The -n
option allows to determine the number of rows that shall be
printed. For example, to extract from a fasta file the first
sequence-id along with the nucleotide or peptide sequence, you can
select the first two lines with:
head -n 2 Fastafile.fasta
The output, in this example, is:
>6libTrinity|comp1261_c0_seq1| MINNLGTIAKSGTKAFMEALSAGADISMIGQFGVGFYSAYLVADKVTVTSKHNDDEQYIWESSAGGSFTIKQDNSEPLGRGTKIVLQMKEDQAEYIEEKKVKEIVKKHSQFIGYPIKLMVQKEREKEVSDDEAEEEKKEETKEEPKIEDVGEDEDADKEDGDKKKKKTIKEKYTEDEELNKTKPIWTRNADDISSEEYGEFYKSLTNDWEEHLAVKHFSVEGQLEFRALLFIPKRAPFDLFENKKSKNNIKLYVRRVFIMDNCDEIIPEYLNFVRGVVDSEDLPLNISREMLQQNKILKVIRKNLVKKVMELIDEIAEDKDNYKKFYEQFSKNMKLGIHEDSTNRKKLAGHLRYFTSASGDEMCGLSDYVSRMKENQKDIYYITGESKDVVGTSSFVETLKKRGLECIYMTEPIDEYVVQQLKEFDGKNLVSVTKEGLELPEDEEEKKKKEADKEKFEPLCKVMKDILDKKVEKVVVSSRLVSSPCCIVTSQYGWTANMERIMKAQALRDTSTMGYMAAKKHLEINPEHSIIENL
When the line number K
is preceded with -
, then all but the last K
lines are printed. For example, the command to print all but the last
ten lines from a fasta file is:
head -n -10 Fastafile.fasta
The tail
command, in contrast, prints by default the last 10 lines
of a file to the terminal. Also here you can select the number of
lines with the -n
option. The following command selects the last 5
lines from Fastafile.fasta
.
tail -n 5 Fastafile.fasta
When the line number K
is preceded by a +
, then all but the first
K
lines are printed. This can be useful to exclude one or several
header lines with meta-information from a file. For example, to
exclude the first three lines from a text file, you can use:
tail -n +4 Textfile.txt
If the stored data are tabular cut
can extract single columns (with
the -f
option) from it. By default, cut
expects your columns to be
TAB delimited. If your columns are instead limited by commas or single
spaces, you can specify this with the -d
option. To extract the fourth
and fifth column from a comma-separated file, you would use the
following command:
cut -d "," -f 4,5 File.csv
By default, cut
prints all lines that do not contain the specified
delimiter. The -s
option allows to exclude these lines from the
output. Here’s the command to extract columns 1 to 6 from File.csv
and disregard all lines not containing a =,= as delimiter.
cut -d "," -f 1-6 -s File.csv
The -c
option allows you to select only a certain range of
characters, instead of specifying which columns to select. So, to
select only characters 1 to 7 from File.csv
, use
cut -c 1-7 File.csv
To combine the tools that I presented above can be a very powerful way
rapidly summarize information from text files. The ‘|’ command
provides a ‘pipe’ to pass the output from one command to another’s
input. For example, if we want to count the number of sequences in a
fasta file with the wc
command, we first need to extract all lines
starting with >
in the file; then we can count the number of lines.
This can be done with a single line of commands:
grep ">" Fastafile.fasta | wc -l
Here, grep
is used to search for >
signs in the fasta file. All
sequence-id’s start with this character. Instead of printing all these
lines to the terminal, we re-direct it to the wc
command with the
piping symbol |
. Using the -l
option, wc
counts all the
lines. Here, wc
doesn’t need an input file.
If we know that all sequence-id’s have 27 characters, like
gi|289900837|gb|ADD21559.1|
, we can easily extract all sequence-id’s
with:
grep ">" Fastafile.fasta | cut -c 2-28
First all lines with a >
character are extracted. These lines are
‘piped’ (with |
) as input to the cut
command, which extracts
character 2 to 28 (leaving out the first character >
) from each
line.
The output is:
gi|226446429|gb|ACO58580.1| gi|226446417|gb|ACO58574.1| gi|226446417|gb|ACO58574.1| gi|359372673|gb|AEV42205.1| gi|359372673|gb|AEV42205.1| gi|307175086|gb|EFN65228.1| gi|307175086|gb|EFN65228.1| gi|307043818|gb|ADN23625.1| gi|354550152|gb|AER28025.1|
If you want to share your own pipelines of commands, please do so in the comments section below - specifically, if they can be used on files that we usually encounter in the analysis of Next Generation Sequencing data, like fasta files, sam files, vcf files, etc.