View and extract data
Published:
Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).
If you are new to the command line you might wonder how to open files to look at their content. Most files that result from ‘Next Generation Sequencing’ analysis-pipelines contain thousands of lines and/or columns of structured text. They can be huge in size and are generally not meant to be opened with conventional GUI programs. The commands that I introduce here equip you with powerful tools to look at and edit small to large data files, and to rapidly extract and summarize information from them - like, for example, the number of sequences in a fasta file.
Display or combine results with ‘cat’
The most common use of the cat
command is to display the content of
text files by typing the name of the file behind the cat
command and
hitting ENTER.
cat Testfile.txt
This command will display the content of Testfile.txt to the screen. It
does not make sense to display the content of very large files in this
way, but with the >
and >>
operators, you can rapidly combine two
files, even huge ones. For example, in the following command
cat Testfile1.txt > CombinedFile.txt cat Testfile2.txt » CombinedFile.txt
The >
operator redirects the output of the cat Testfile1.txt
command
(the content of Testfile1.txt
) to CombinedFile.txt
. The >>
operator adds the output of the cat Testfile2.txt
command to the
operator in the second line, the content of CombinedFile.txt
would be
overwritten, not appended. So, the >
operator (over)writes content to
a specified file while the >>
operator appends content to a specified
file. If you use the >>
operator, the specified file needs to exist
already.
Uncover the content of a file line by line with ‘less’, ‘more’, and ‘most’
The purpose of the three tools less
, more
, and most
is pretty
similar: to display the content of text files one line after the other.
Since they don't try to read the entire content of a file at once, they
are very useful to look into large files.
Most likely, less
and more
are by default installed on your system.
To install most
, type
sudo apt-get install most
Enter your user password and hit ENTER.
Historically, more
is the oldest tool. Compared to the other two, it
only allows to scroll forward in a file, but not backward.
For all three tools you need to write the file name after the command to display its content. Try them out and compare the behavior of the three tools.
more Testfile.txt less Testfile.txt most Testfile.txt
The most
and less
commands give you many options to navigate through
and search for patterns in the text files. Here I show some examples of
useful options for the less
command.
Once you opened a fasta file , for example, with less
…
less Fastafile.fasta
… you can search for patterns, like the nucleotide sequence ‘GCTC’,
with /
, like
/GCTC
With hitting n
you can repeat this search on the next lines in the
file.
To show only those lines in the file that match the nucleotide sequence
‘GCTC’, type this sequence after the &
sign:
&GCTC
To go to the last line of the file, just type G
, to go to the first
line, type g
. To close the file again, hit q
.
The less
command has more options than this. You get an overview of
these with the following command
less –help
Counting words, lines, and characters with ‘wc’
If you want to get a rapid overview of the number of lines in a file,
the wc
command is the right tool. In output-files for example, where
every line represents a sequence, wc
is all you need to count the
number of sequences.
wc -l File.txt
The -l
option specifies that you want to count the number of lines.
The -m
and -w
options further allow you to count the number of
characters or words, which is a handy tool to check if the abstract of
your manuscript exceeds the upper limits of your target journal.
Open and edit smaller files with ‘nano’
It is not meant to open big files, like for example raw fastq files that
were obtained from Next Generation Sequencing. The name of the file you
want to open has to follow the nano
command. For example,
nano Testfile.sh
Once you hit ENTER, Testfile.sh
will be opened and you can scroll
through it, and compared to the tools we looked at before, you can edit
the content of the file by deleting and adding text. At the bottom of
the terminal window you see some shortcuts for certain actions. For
example ^O WriteOut
or ^X Exit
. The ^
indicates that you need to
press CTRL+O or CTRL+X. Just open a file and try it out.
Display lines that match a certain pattern with ‘grep’
lines are written out to the terminal; or can be re-directed to a file
with the >
operator (described in the introduction section of cat
).
This enables you, for example, to search for a certain sequence-id in a fastq file.
grep “gi | 524845790 | gb | AGR34129.1 | ” Fastafile.fasta |
This command searches for the heat shock protein Hsp90 [Daphnia pulex]
with the gene-id ‘gi|524845790|gb|AGR34129.1|
’ in Fastafile.fasta
.
Since the gene-id search pattern as special characters (|
), you need
to enclose it in quotation marks ("
). The same applies when the search
pattern contains any spaces. The command above prints out the line in
which it found the gene-id.
gi 524845790 gb AGR34129.1 Hsp90 [Daphnia pulex]
You can apply the search to more than one file at a time. To search for the gene-id in all fasta files in your present working directory, you could type
grep -n “gi | 524845790 | gb | AGR34129.1 | ” *.fasta |
The asterisk *
stands for ‘any character’. The -n
option adds the
line numbers, that have matched the search pattern, in front of the
output lines. As I had two fasta files in my current folder,
alldata.fasta:131:>gi | 524845790 | gb | AGR34129.1 | Hsp90 [Daphnia pulex] |
besthits.fasta:35:>gi | 524845790 | gb | AGR34129.1 | Hsp90 [Daphnia pulex] |
The gene-id was found in alldata.fasta
on line 131 and in
option to inverse the search and find all lines that don't match the
your search-pattern.
number of lines that your search pattern was found. In fasta files, each sequence record starts with a ‘>’ sign. Thus, to count the number of sequences in a fasta file, you can use the following command
grep -c “>” alldata.fasta 64
The output is 64; meaning that the alldata.fasta
file contains 64
sequences.
Select rows of a file with ‘head’ and ‘tail’
The head
command, followed by the name of a text file, prints by
default the first 10 lines/rows of the file to the terminal. The -n
option allows to determine the number of rows that shall be printed. For
example, to extract from a fasta file the first sequence-id along with
the nucleotide or peptide sequence, you can select the first two lines
with:
head -n 2 Fastafile.fasta
The output, in this example, is:
6libTrinity|comp1261_c0_seq1| MINNLGTIAKSGTKAFMEALSAGADISMIGQFGVGFYSAYLVADKVTVTSKHNDDEQYIWESSAGGSFTIKQDNSEPLGRGTKIVLQMKEDQAEYIEEKKVKEIVKKHSQFIGYPIKLMVQKEREKEVSDDEAEEEKKEETKEEPKIEDVGEDEDADKEDGDKKKKKTIKEKYTEDEELNKTKPIWTRNADDISSEEYGEFYKSLTNDWEEHLAVKHFSVEGQLEFRALLFIPKRAPFDLFENKKSKNNIKLYVRRVFIMDNCDEIIPEYLNFVRGVVDSEDLPLNISREMLQQNKILKVIRKNLVKKVMELIDEIAEDKDNYKKFYEQFSKNMKLGIHEDSTNRKKLAGHLRYFTSASGDEMCGLSDYVSRMKENQKDIYYITGESKDVVGTSSFVETLKKRGLECIYMTEPIDEYVVQQLKEFDGKNLVSVTKEGLELPEDEEEKKKKEADKEKFEPLCKVMKDILDKKVEKVVVSSRLVSSPCCIVTSQYGWTANMERIMKAQALRDTSTMGYMAAKKHLEINPEHSIIENL
When the line number K
is preceded with -
, then all but the last K
lines are printed. For example, the command to print all but the last
ten lines from a fasta file is:
head -n -10 Fastafile.fasta
The tail
command, in contrast, prints by default the last 10 lines of
a file to the terminal. Also here you can select the number of lines
with the -n
option. The following command selects the last 5 lines
from Fastafile.fasta
.
tail -n 5 Fastafile.fasta
When the line number K
is preceded by a +
, then all but the first
header lines with meta-information from a file. For example, to exclude
the first three lines from a text file, you can use:
tail -n +4 Textfile.txt
Extracting columns with ‘cut’
If the stored data are tabular cut
can extract single columns (with
the -f
option) from it. By default, cut
expects your columns to be
TAB delimited. If your columns are instead limited by commas or single
spaces, you can specify this with the -d
option. To extract the fourth
and fifth column from a comma-separated file, you would use the
following command:
cut -d “,” -f 4,5 File.csv
By default, cut
prints all lines that do not contain the specified
delimiter. The -s
option allows to exclude these lines from the
output. Here's the command to extract columns 1 to 6 from File.csv
and disregard all lines not containing a ,
as delimiter.
cut -d “,” -f 1-6 -s File.csv
The -c
option allows you to select only a certain range of characters,
instead of specifying which columns to select. So, to select only
characters 1 to 7 from File.csv
, use
cut -c 1-7 File.csv
The powerful piping tool ‘|’
To combine the tools that I presented above can be a very powerful way
rapidly summarize information from text files. The ‘|’ command provides
a ‘pipe’ to pass the output from one command to another's input. For
example, if we want to count the number of sequences in a fasta file
with the wc
command, we first need to extract all lines starting with
with a single line of commands:
grep “>” Fastafile.fasta | wc -l |
Here, grep
is used to search for >
signs in the fasta file. All
sequence-id's start with this character. Instead of printing all these
lines to the terminal, we re-direct it to the wc
command with the
piping symbol |
. Using the -l
option, wc
counts all the lines.
Here, wc
doesn't need an input file.
If we know that all sequence-id's have 27 characters, like with:
grep “>” Fastafile.fasta | cut -c 2-28 |
First all lines with a >
character are extracted. These lines are
‘piped’ (with |
) as input to the cut
command, which extracts
character 2 to 28 (leaving out the first character >
) from each line.
The output is:
gi | 226446429 | gb | ACO58580.1 |
gi | 226446417 | gb | ACO58574.1 |
gi | 226446417 | gb | ACO58574.1 |
gi | 359372673 | gb | AEV42205.1 |
gi | 359372673 | gb | AEV42205.1 |
gi | 307175086 | gb | EFN65228.1 |
gi | 307175086 | gb | EFN65228.1 |
gi | 307043818 | gb | ADN23625.1 |
gi | 354550152 | gb | AER28025.1 |
If you want to share your own pipelines of commands, please do so in the comments section below - specifically, if they can be used on files that we usually encounter in the analysis of Next Generation Sequencing data, like fasta files, sam files, vcf files, etc.