Bowtie2 mapping overview

3 minute read

Published:

You have mapped reads of genetic code against a genomic reference with bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and would like to get some overview statistics from your SAM files?

Here, I present a python script that takes one or more SAM files as input and provides the following information in a text file named

  • the total number of aligned reads
  • the percentage of reads with a mapping quality below 20
  • the percentage of unmapped reads
  • the percentage of discordantly mapped reads
  • the percentage of non-unique mappings

The script uses fields that are specific to SAM files:

  • Detection of discordantly mapped reads is based on the field ‘YT:Z:DP’
  • Detection of non-unique mappings is based on the field ‘XS:i’

It was not tested on other sam files.

Download the script

You can download the script Bowtie2MappingOverview.py from:

How to use it?

Extract the python script and save it to the folder where one or several sam files are located. Open a terminal and navigate to this folder.

To apply the script on one sam file, type:

python Bowtie2MappingOverview.py Samfile.sam

Change Samfile.sam to the name of your own sam file.

To apply it to several sam files, type the name of each of them after the first sam file name, like:

python Bowtie2MappingOverview.py Samfile1.sam Samfile2.sam Samfile3.sam

In the above command, 3 sam files are analyzed. The created out.txt file will contain information in the following format:

sequence total mapq20 unmapped discordant nonunique
Samfile1.sam 12369010 0.199497 0.797805 0.005362 0.001147
Samfile2.sam 12369010 0.747813 0.150322 0.072331 0.055457
Samfile3.sam 11253608 0.200873 0.796430 0.004038 0.001412

To apply it to all sam files in a folder, type:

python Bowtie2MappingOverview.py *.sam

Code

The script contains the following python code

import re # to use regular expressions import sys # to read user arguments

Define the search patterns:

HeaderPattern= re.compile(‘^@’) DiscordantPattern= re.compile(‘YT:Z:DP’) # This field is specific to mappings that were done with Bowtie2 NonUniquePattern= re.compile(‘XS:i’) # This field is specific for mappings that mapped more than once outfile=open(‘out.txt’,’w’) headline= [‘sequence’,’total’,’mapq20’,’unmapped’,’discordant’,’nonunique’] headlineout=’\t’.join(headline) outfile.write(headlineout) outfile.write(‘\n’)

def MappingOverview(samfile): ‘’’ Counts the total number of aligned reads; the percentage of reads with a mapping quality below 20; the percentage of unmapped reads; the percentage of discordantly mapped reads; and the percentage of non-unique mappings. ‘’’ sam=open(samfile) total=0 mapq20=0 unmapped=0 discordant=0 nonunique=0 for line in sam: line=line.strip() if not HeaderPattern.search(line): # if there is no @ at the beginning of the line total=total+1 splitted=line.split(‘\t’) # Split the lines by tab separators if int(splitted[4])>19: # The fifth column – which is the mapping quality – shall be over 19 mapq20=mapq20+1 if splitted[2] is ‘*’: # column 3 in the SAM file shows “Name of reference sequence where alignment occurs” or a * for unmapped reads. unmapped=unmapped+1 if DiscordantPattern.search(line): # Look for the field indicating discordant mappings discordant=discordant+1 if NonUniquePattern.search(line): # Look for the field indicating concordant mappings nonunique=nonunique+1 outfile.write(‘%s\t%12i\t%7.6f\t%7.6f\t%7.6f\t%7.6f\n’ % (samfile, total, float(mapq20)/float(total), float(unmapped)/float(total), float(discordant)/float(total), float(nonunique)/float(total)))

for s in sys.argv[1:]: MappingOverview(str(s))