marinetics blog

Bioinformatics Solutions in Ecology and Evolution

Alexander-Jueterbock@web.de

Homepage

21 Feb 2015

View and extract data

Tagged: Commandline

05 May 2017
» Selecting MethylRAD tags with less degenerate adaptor ends
11 Apr 2017
» In silico digestion with type IIB restricrion enzymes
23 Sep 2015
» R package 'MaxentVariableSelection' released

Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).

If you are new to the command line you might wonder how to open files to look at their content. Most files that result from ‘Next Generation Sequencing’ analysis-pipelines contain thousands of lines and/or columns of structured text. They can be huge in size and are generally not meant to be opened with conventional GUI programs. The commands that I introduce here equip you with powerful tools to look at and edit small to large data files, and to rapidly extract and summarize information from them - like, for example, the number of sequences in a fasta file.

Display or combine results with ‘cat’

The most common use of the cat command is to display the content of text files by typing the name of the file behind the cat command and hitting ENTER.

cat Testfile.txt

This command will display the content of Testfile.txt to the screen. It does not make sense to display the content of very large files in this way, but with the > and >> operators, you can rapidly combine two files, even huge ones. For example, in the following command Testfile1.txt and Testfile2.txt are combined to CombinedFile.txt.

cat Testfile1.txt > CombinedFile.txt
cat Testfile2.txt >> CombinedFile.txt

The > operator redirects the output of the cat Testfile1.txt command (the content of Testfile1.txt) to CombinedFile.txt. The >> operator adds the output of the cat Testfile2.txt command to the CombinedFile.txt. If we would use the > operator instead of the >> operator in the second line, the content of CombinedFile.txt would be overwritten, not appended. So, the > operator (over)writes content to a specified file while the >> operator appends content to a specified file. If you use the >> operator, the specified file needs to exist already.

Uncover the content of a file line by line with ‘less’, ‘more’, and ‘most’

The purpose of the three tools less, more, and most is pretty similar: to display the content of text files one line after the other. Since they don’t try to read the entire content of a file at once, they are very useful to look into large files.

Most likely, less and more are by default installed on your system. To install most, type

sudo apt-get install most

Enter your user password and hit ENTER.

Historically, more is the oldest tool. Compared to the other two, it only allows to scroll forward in a file, but not backward.

For all three tools you need to write the file name after the command to display its content. Try them out and compare the behavior of the three tools.

more Testfile.txt
less Testfile.txt
most Testfile.txt

The most and less commands give you many options to navigate through and search for patterns in the text files. Here I show some examples of useful options for the less command.

Once you opened a fasta file , for example, with less …

less Fastafile.fasta

… you can search for patterns, like the nucleotide sequence ‘GCTC’, with /, like

/GCTC

With hitting n you can repeat this search on the next lines in the file.

To show only those lines in the file that match the nucleotide sequence ‘GCTC’, type this sequence after the & sign:

&GCTC

To go to the last line of the file, just type G, to go to the first line, type g. To close the file again, hit q.

The less command has more options than this. You get an overview of these with the following command

less --help

Counting words, lines, and characters with ‘wc’

If you want to get a rapid overview of the number of lines in a file, the wc command is the right tool. In output-files for example, where every line represents a sequence, wc is all you need to count the number of sequences.

wc -l File.txt

The -l option specifies that you want to count the number of lines. The -m and -w options further allow you to count the number of characters or words, which is a handy tool to check if the abstract of your manuscript exceeds the upper limits of your target journal.

Open and edit smaller files with ‘nano’

nano allows to open and edit small text files from the command line. It is not meant to open big files, like for example raw fastq files that were obtained from Next Generation Sequencing. The name of the file you want to open has to follow the nano command. For example,

nano Testfile.sh

Once you hit ENTER, Testfile.sh will be opened and you can scroll through it, and compared to the tools we looked at before, you can edit the content of the file by deleting and adding text. At the bottom of the terminal window you see some shortcuts for certain actions. For example ^O WriteOut or ^X Exit. The ^ indicates that you need to press CTRL+O or CTRL+X. Just open a file and try it out.

Display lines that match a certain pattern with ‘grep’

grep allows you to search for patterns in a text file. All matching lines are written out to the terminal; or can be re-directed to a file with the > operator (described in the introduction section of cat).

This enables you, for example, to search for a certain sequence-id in a fastq file.

grep "gi|524845790|gb|AGR34129.1|" Fastafile.fasta

This command searches for the heat shock protein Hsp90 [Daphnia pulex] with the gene-id ‘gi|524845790|gb|AGR34129.1|’ in Fastafile.fasta. Since the gene-id search pattern as special characters (|), you need to enclose it in quotation marks (="=). The same applies when the search pattern contains any spaces. The command above prints out the line in which it found the gene-id.

>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex]

You can apply the search to more than one file at a time. To search for the gene-id in all fasta files in your present working directory, you could type

grep -n "gi|524845790|gb|AGR34129.1|" *.fasta

The asterisk * stands for ‘any character’. The -n option adds the line numbers, that have matched the search pattern, in front of the output lines. As I had two fasta files in my current folder, alldata.fasta and besthits.fasta, the output is

alldata.fasta:131:>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex]
besthits.fasta:35:>gi|524845790|gb|AGR34129.1| Hsp90 [Daphnia pulex]

The gene-id was found in alldata.fasta on line 131 and in besthits.fasta in line 35. You can also use grep with the -v option to inverse the search and find all lines that don’t match the your search-pattern.

grep can do one more thing that is very useful: it can count the number of lines that your search pattern was found. In fasta files, each sequence record starts with a ‘>’ sign. Thus, to count the number of sequences in a fasta file, you can use the following command

grep -c ">" alldata.fasta
64

The output is 64; meaning that the alldata.fasta file contains 64 sequences.

Select rows of a file with ‘head’ and ‘tail’

The head command, followed by the name of a text file, prints by default the first 10 lines/rows of the file to the terminal. The -n option allows to determine the number of rows that shall be printed. For example, to extract from a fasta file the first sequence-id along with the nucleotide or peptide sequence, you can select the first two lines with:

head -n 2 Fastafile.fasta

The output, in this example, is:

>6libTrinity|comp1261_c0_seq1|
MINNLGTIAKSGTKAFMEALSAGADISMIGQFGVGFYSAYLVADKVTVTSKHNDDEQYIWESSAGGSFTIKQDNSEPLGRGTKIVLQMKEDQAEYIEEKKVKEIVKKHSQFIGYPIKLMVQKEREKEVSDDEAEEEKKEETKEEPKIEDVGEDEDADKEDGDKKKKKTIKEKYTEDEELNKTKPIWTRNADDISSEEYGEFYKSLTNDWEEHLAVKHFSVEGQLEFRALLFIPKRAPFDLFENKKSKNNIKLYVRRVFIMDNCDEIIPEYLNFVRGVVDSEDLPLNISREMLQQNKILKVIRKNLVKKVMELIDEIAEDKDNYKKFYEQFSKNMKLGIHEDSTNRKKLAGHLRYFTSASGDEMCGLSDYVSRMKENQKDIYYITGESKDVVGTSSFVETLKKRGLECIYMTEPIDEYVVQQLKEFDGKNLVSVTKEGLELPEDEEEKKKKEADKEKFEPLCKVMKDILDKKVEKVVVSSRLVSSPCCIVTSQYGWTANMERIMKAQALRDTSTMGYMAAKKHLEINPEHSIIENL

When the line number K is preceded with -, then all but the last K lines are printed. For example, the command to print all but the last ten lines from a fasta file is:

head -n -10 Fastafile.fasta

The tail command, in contrast, prints by default the last 10 lines of a file to the terminal. Also here you can select the number of lines with the -n option. The following command selects the last 5 lines from Fastafile.fasta.

tail -n 5 Fastafile.fasta

When the line number K is preceded by a +, then all but the first K lines are printed. This can be useful to exclude one or several header lines with meta-information from a file. For example, to exclude the first three lines from a text file, you can use:

tail -n +4 Textfile.txt

Extracting columns with ‘cut’

If the stored data are tabular cut can extract single columns (with the -f option) from it. By default, cut expects your columns to be TAB delimited. If your columns are instead limited by commas or single spaces, you can specify this with the -d option. To extract the fourth and fifth column from a comma-separated file, you would use the following command:

cut -d "," -f 4,5 File.csv

By default, cut prints all lines that do not contain the specified delimiter. The -s option allows to exclude these lines from the output. Here’s the command to extract columns 1 to 6 from File.csv and disregard all lines not containing a =,= as delimiter.

cut -d "," -f 1-6 -s File.csv

The -c option allows you to select only a certain range of characters, instead of specifying which columns to select. So, to select only characters 1 to 7 from File.csv, use

cut  -c 1-7 File.csv

The powerful piping tool ‘|’

To combine the tools that I presented above can be a very powerful way rapidly summarize information from text files. The ‘|’ command provides a ‘pipe’ to pass the output from one command to another’s input. For example, if we want to count the number of sequences in a fasta file with the wc command, we first need to extract all lines starting with > in the file; then we can count the number of lines. This can be done with a single line of commands:

grep ">" Fastafile.fasta | wc -l

Here, grep is used to search for > signs in the fasta file. All sequence-id’s start with this character. Instead of printing all these lines to the terminal, we re-direct it to the wc command with the piping symbol |. Using the -l option, wc counts all the lines. Here, wc doesn’t need an input file.

If we know that all sequence-id’s have 27 characters, like gi|289900837|gb|ADD21559.1|, we can easily extract all sequence-id’s with:

grep ">" Fastafile.fasta | cut -c 2-28

First all lines with a > character are extracted. These lines are ‘piped’ (with |) as input to the cut command, which extracts character 2 to 28 (leaving out the first character >) from each line.

The output is:

gi|226446429|gb|ACO58580.1|
gi|226446417|gb|ACO58574.1|
gi|226446417|gb|ACO58574.1|
gi|359372673|gb|AEV42205.1|
gi|359372673|gb|AEV42205.1|
gi|307175086|gb|EFN65228.1|
gi|307175086|gb|EFN65228.1|
gi|307043818|gb|ADN23625.1|
gi|354550152|gb|AER28025.1|

If you want to share your own pipelines of commands, please do so in the comments section below - specifically, if they can be used on files that we usually encounter in the analysis of Next Generation Sequencing data, like fasta files, sam files, vcf files, etc.

Comments