You have mapped reads of genetic code against a genomic reference with bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and would like to get some overview statistics from your SAM files?
Here, I present a python script that takes one or more SAM files as
input and provides the following information in a text file named
out.txt
:
The script uses fields that are specific to SAM files:
It was not tested on other sam files.
You can download the script Bowtie2MappingOverview.py
from:
Extract the python script and save it to the folder where one or several sam files are located. Open a terminal and navigate to this folder.
To apply the script on one sam file, type:
python Bowtie2MappingOverview.py Samfile.sam
Change Samfile.sam
to the name of your own sam file.
To apply it to several sam files, type the name of each of them after the first sam file name, like:
python Bowtie2MappingOverview.py Samfile1.sam Samfile2.sam Samfile3.sam
In the above command, 3 sam files are analyzed. The created out.txt
file will contain information in the following format:
| sequence | total | mapq20 | unmapped | discordant | nonunique | |--------------+----------+----------+----------+------------+-----------| | Samfile1.sam | 12369010 | 0.199497 | 0.797805 | 0.005362 | 0.001147 | | Samfile2.sam | 12369010 | 0.747813 | 0.150322 | 0.072331 | 0.055457 | | Samfile3.sam | 11253608 | 0.200873 | 0.796430 | 0.004038 | 0.001412 |
To apply it to all sam files in a folder, type:
python Bowtie2MappingOverview.py *.sam
The script contains the following python code
import re # to use regular expressions import sys # to read user arguments # Define the search patterns: HeaderPattern= re.compile('^@') DiscordantPattern= re.compile('YT:Z:DP') # This field is specific to mappings that were done with Bowtie2 NonUniquePattern= re.compile('XS:i') # This field is specific for mappings that mapped more than once outfile=open('out.txt','w') headline= ['sequence','total','mapq20','unmapped','discordant','nonunique'] headlineout='\t'.join(headline) outfile.write(headlineout) outfile.write('\n') def MappingOverview(samfile): ''' Counts the total number of aligned reads; the percentage of reads with a mapping quality below 20; the percentage of unmapped reads; the percentage of discordantly mapped reads; and the percentage of non-unique mappings. ''' sam=open(samfile) total=0 mapq20=0 unmapped=0 discordant=0 nonunique=0 for line in sam: line=line.strip() if not HeaderPattern.search(line): # if there is no @ at the beginning of the line total=total+1 splitted=line.split('\t') # Split the lines by tab separators if int(splitted[4])>19: # The fifth column -- which is the mapping quality -- shall be over 19 mapq20=mapq20+1 if splitted[2] is '*': # column 3 in the SAM file shows "Name of reference sequence where alignment occurs" or a * for unmapped reads. unmapped=unmapped+1 if DiscordantPattern.search(line): # Look for the field indicating discordant mappings discordant=discordant+1 if NonUniquePattern.search(line): # Look for the field indicating concordant mappings nonunique=nonunique+1 outfile.write('%s\t%12i\t%7.6f\t%7.6f\t%7.6f\t%7.6f\n' % (samfile, total, float(mapq20)/float(total), float(unmapped)/float(total), float(discordant)/float(total), float(nonunique)/float(total))) for s in sys.argv[1:]: MappingOverview(str(s))