marinetics blog


Bioinformatics Solutions in Ecology and Evolution

24 Feb 2015

Bowtie2 mapping overview

Tagged: mapping python NGS






Related Posts

You have mapped reads of genetic code against a genomic reference with bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and would like to get some overview statistics from your SAM files?

Here, I present a python script that takes one or more SAM files as input and provides the following information in a text file named out.txt:

  • the total number of aligned reads
  • the percentage of reads with a mapping quality below 20
  • the percentage of unmapped reads
  • the percentage of discordantly mapped reads
  • the percentage of non-unique mappings

The script uses fields that are specific to SAM files:

  • Detection of discordantly mapped reads is based on the field ‘YT:Z:DP’
  • Detection of non-unique mappings is based on the field ‘XS:i’

It was not tested on other sam files.

Download the script

You can download the script Bowtie2MappingOverview.py from:

How to use it?

Extract the python script and save it to the folder where one or several sam files are located. Open a terminal and navigate to this folder.

To apply the script on one sam file, type:

python Bowtie2MappingOverview.py Samfile.sam

Change Samfile.sam to the name of your own sam file.

To apply it to several sam files, type the name of each of them after the first sam file name, like:

python Bowtie2MappingOverview.py Samfile1.sam Samfile2.sam Samfile3.sam

In the above command, 3 sam files are analyzed. The created out.txt file will contain information in the following format:

| sequence     |    total |   mapq20 | unmapped | discordant | nonunique |
|--------------+----------+----------+----------+------------+-----------|
| Samfile1.sam | 12369010 | 0.199497 | 0.797805 |   0.005362 |  0.001147 |
| Samfile2.sam | 12369010 | 0.747813 | 0.150322 |   0.072331 |  0.055457 |
| Samfile3.sam | 11253608 | 0.200873 | 0.796430 |   0.004038 |  0.001412 |

To apply it to all sam files in a folder, type:

python Bowtie2MappingOverview.py *.sam

Code

The script contains the following python code

import re # to use regular expressions
import sys # to read user arguments

# Define the search patterns:

HeaderPattern= re.compile('^@')
DiscordantPattern= re.compile('YT:Z:DP') # This field is specific to mappings that were done with Bowtie2
NonUniquePattern= re.compile('XS:i') # This field is specific for mappings that mapped more than once
outfile=open('out.txt','w')
headline= ['sequence','total','mapq20','unmapped','discordant','nonunique']
headlineout='\t'.join(headline)
outfile.write(headlineout)
outfile.write('\n')

def MappingOverview(samfile):
'''
Counts the total number of aligned reads;
the percentage of reads with a mapping quality below 20;
the percentage of unmapped reads;
the percentage of discordantly mapped reads;
and the percentage of non-unique mappings.
'''
sam=open(samfile)
total=0
mapq20=0
unmapped=0
discordant=0
nonunique=0
for line in sam:
line=line.strip()
if not HeaderPattern.search(line): # if there is no @ at the beginning of the line
total=total+1
splitted=line.split('\t') # Split the lines by tab separators
if int(splitted[4])>19: # The fifth column -- which is the mapping quality -- shall be over 19
mapq20=mapq20+1
if splitted[2] is '*': # column 3 in the SAM file shows "Name of reference sequence where alignment occurs" or a * for unmapped reads.
unmapped=unmapped+1
if DiscordantPattern.search(line): # Look for the field indicating discordant mappings
discordant=discordant+1
if NonUniquePattern.search(line): # Look for the field indicating concordant mappings
nonunique=nonunique+1
outfile.write('%s\t%12i\t%7.6f\t%7.6f\t%7.6f\t%7.6f\n' % (samfile, total, float(mapq20)/float(total), float(unmapped)/float(total), float(discordant)/float(total), float(nonunique)/float(total)))

for s in sys.argv[1:]:
MappingOverview(str(s))