View and extract data

11 minute read

Published:

Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).

If you are new to the command line you might wonder how to open files to look at their content. Most files that result from ‘Next Generation Sequencing’ analysis-pipelines contain thousands of lines and/or columns of structured text. They can be huge in size and are generally not meant to be opened with conventional GUI programs. The commands that I introduce here equip you with powerful tools to look at and edit small to large data files, and to rapidly extract and summarize information from them - like, for example, the number of sequences in a fasta file.

Display or combine results with ‘cat’

The most common use of the cat command is to display the content of text files by typing the name of the file behind the cat command and hitting ENTER.

cat Testfile.txt

This command will display the content of Testfile.txt to the screen. It does not make sense to display the content of very large files in this way, but with the > and >> operators, you can rapidly combine two files, even huge ones. For example, in the following command

cat Testfile1.txt > CombinedFile.txt cat Testfile2.txt » CombinedFile.txt

The > operator redirects the output of the cat Testfile1.txt command (the content of Testfile1.txt) to CombinedFile.txt. The >> operator adds the output of the cat Testfile2.txt command to the operator in the second line, the content of CombinedFile.txt would be overwritten, not appended. So, the > operator (over)writes content to a specified file while the >> operator appends content to a specified file. If you use the >> operator, the specified file needs to exist already.

Uncover the content of a file line by line with ‘less’, ‘more’, and ‘most’

The purpose of the three tools less, more, and most is pretty similar: to display the content of text files one line after the other. Since they don't try to read the entire content of a file at once, they are very useful to look into large files.

Most likely, less and more are by default installed on your system. To install most, type

sudo apt-get install most

Enter your user password and hit ENTER.

Historically, more is the oldest tool. Compared to the other two, it only allows to scroll forward in a file, but not backward.

For all three tools you need to write the file name after the command to display its content. Try them out and compare the behavior of the three tools.

more Testfile.txt less Testfile.txt most Testfile.txt

The most and less commands give you many options to navigate through and search for patterns in the text files. Here I show some examples of useful options for the less command.

Once you opened a fasta file , for example, with less

less Fastafile.fasta

… you can search for patterns, like the nucleotide sequence ‘GCTC’, with /, like

/GCTC

With hitting n you can repeat this search on the next lines in the file.

To show only those lines in the file that match the nucleotide sequence ‘GCTC’, type this sequence after the & sign:

&GCTC

To go to the last line of the file, just type G, to go to the first line, type g. To close the file again, hit q.

The less command has more options than this. You get an overview of these with the following command

less –help

Counting words, lines, and characters with ‘wc’

If you want to get a rapid overview of the number of lines in a file, the wc command is the right tool. In output-files for example, where every line represents a sequence, wc is all you need to count the number of sequences.

wc -l File.txt

The -l option specifies that you want to count the number of lines. The -m and -w options further allow you to count the number of characters or words, which is a handy tool to check if the abstract of your manuscript exceeds the upper limits of your target journal.

Open and edit smaller files with ‘nano’

It is not meant to open big files, like for example raw fastq files that were obtained from Next Generation Sequencing. The name of the file you want to open has to follow the nano command. For example,

nano Testfile.sh

Once you hit ENTER, Testfile.sh will be opened and you can scroll through it, and compared to the tools we looked at before, you can edit the content of the file by deleting and adding text. At the bottom of the terminal window you see some shortcuts for certain actions. For example ^O WriteOut or ^X Exit. The ^ indicates that you need to press CTRL+O or CTRL+X. Just open a file and try it out.

Display lines that match a certain pattern with ‘grep’

lines are written out to the terminal; or can be re-directed to a file with the > operator (described in the introduction section of cat).

This enables you, for example, to search for a certain sequence-id in a fastq file.

grep “gi 524845790 gb AGR34129.1 ” Fastafile.fasta

This command searches for the heat shock protein Hsp90 [Daphnia pulex] with the gene-id ‘gi|524845790|gb|AGR34129.1|’ in Fastafile.fasta. Since the gene-id search pattern as special characters (|), you need to enclose it in quotation marks ("). The same applies when the search pattern contains any spaces. The command above prints out the line in which it found the gene-id.

gi 524845790 gb AGR34129.1 Hsp90 [Daphnia pulex]

You can apply the search to more than one file at a time. To search for the gene-id in all fasta files in your present working directory, you could type

grep -n “gi 524845790 gb AGR34129.1 ” *.fasta

The asterisk * stands for ‘any character’. The -n option adds the line numbers, that have matched the search pattern, in front of the output lines. As I had two fasta files in my current folder,

alldata.fasta:131:>gi 524845790 gb AGR34129.1 Hsp90 [Daphnia pulex]
besthits.fasta:35:>gi 524845790 gb AGR34129.1 Hsp90 [Daphnia pulex]

The gene-id was found in alldata.fasta on line 131 and in option to inverse the search and find all lines that don't match the your search-pattern.

number of lines that your search pattern was found. In fasta files, each sequence record starts with a ‘>’ sign. Thus, to count the number of sequences in a fasta file, you can use the following command

grep -c “>” alldata.fasta 64

The output is 64; meaning that the alldata.fasta file contains 64 sequences.

Select rows of a file with ‘head’ and ‘tail’

The head command, followed by the name of a text file, prints by default the first 10 lines/rows of the file to the terminal. The -n option allows to determine the number of rows that shall be printed. For example, to extract from a fasta file the first sequence-id along with the nucleotide or peptide sequence, you can select the first two lines with:

head -n 2 Fastafile.fasta

The output, in this example, is:

6libTrinity|comp1261_c0_seq1| MINNLGTIAKSGTKAFMEALSAGADISMIGQFGVGFYSAYLVADKVTVTSKHNDDEQYIWESSAGGSFTIKQDNSEPLGRGTKIVLQMKEDQAEYIEEKKVKEIVKKHSQFIGYPIKLMVQKEREKEVSDDEAEEEKKEETKEEPKIEDVGEDEDADKEDGDKKKKKTIKEKYTEDEELNKTKPIWTRNADDISSEEYGEFYKSLTNDWEEHLAVKHFSVEGQLEFRALLFIPKRAPFDLFENKKSKNNIKLYVRRVFIMDNCDEIIPEYLNFVRGVVDSEDLPLNISREMLQQNKILKVIRKNLVKKVMELIDEIAEDKDNYKKFYEQFSKNMKLGIHEDSTNRKKLAGHLRYFTSASGDEMCGLSDYVSRMKENQKDIYYITGESKDVVGTSSFVETLKKRGLECIYMTEPIDEYVVQQLKEFDGKNLVSVTKEGLELPEDEEEKKKKEADKEKFEPLCKVMKDILDKKVEKVVVSSRLVSSPCCIVTSQYGWTANMERIMKAQALRDTSTMGYMAAKKHLEINPEHSIIENL

When the line number K is preceded with -, then all but the last K lines are printed. For example, the command to print all but the last ten lines from a fasta file is:

head -n -10 Fastafile.fasta

The tail command, in contrast, prints by default the last 10 lines of a file to the terminal. Also here you can select the number of lines with the -n option. The following command selects the last 5 lines from Fastafile.fasta.

tail -n 5 Fastafile.fasta

When the line number K is preceded by a +, then all but the first header lines with meta-information from a file. For example, to exclude the first three lines from a text file, you can use:

tail -n +4 Textfile.txt

Extracting columns with ‘cut’

If the stored data are tabular cut can extract single columns (with the -f option) from it. By default, cut expects your columns to be TAB delimited. If your columns are instead limited by commas or single spaces, you can specify this with the -d option. To extract the fourth and fifth column from a comma-separated file, you would use the following command:

cut -d “,” -f 4,5 File.csv

By default, cut prints all lines that do not contain the specified delimiter. The -s option allows to exclude these lines from the output. Here's the command to extract columns 1 to 6 from File.csv and disregard all lines not containing a , as delimiter.

cut -d “,” -f 1-6 -s File.csv

The -c option allows you to select only a certain range of characters, instead of specifying which columns to select. So, to select only characters 1 to 7 from File.csv, use

cut -c 1-7 File.csv

The powerful piping tool ‘|’

To combine the tools that I presented above can be a very powerful way rapidly summarize information from text files. The ‘|’ command provides a ‘pipe’ to pass the output from one command to another's input. For example, if we want to count the number of sequences in a fasta file with the wc command, we first need to extract all lines starting with with a single line of commands:

grep “>” Fastafile.fasta wc -l

Here, grep is used to search for > signs in the fasta file. All sequence-id's start with this character. Instead of printing all these lines to the terminal, we re-direct it to the wc command with the piping symbol |. Using the -l option, wc counts all the lines. Here, wc doesn't need an input file.

If we know that all sequence-id's have 27 characters, like with:

grep “>” Fastafile.fasta cut -c 2-28

First all lines with a > character are extracted. These lines are ‘piped’ (with |) as input to the cut command, which extracts character 2 to 28 (leaving out the first character >) from each line.

The output is:

gi 226446429 gb ACO58580.1
gi 226446417 gb ACO58574.1
gi 226446417 gb ACO58574.1
gi 359372673 gb AEV42205.1
gi 359372673 gb AEV42205.1
gi 307175086 gb EFN65228.1
gi 307175086 gb EFN65228.1
gi 307043818 gb ADN23625.1
gi 354550152 gb AER28025.1

If you want to share your own pipelines of commands, please do so in the comments section below - specifically, if they can be used on files that we usually encounter in the analysis of Next Generation Sequencing data, like fasta files, sam files, vcf files, etc.