marinetics blog


Bioinformatics Solutions in Ecology and Evolution

03 May 2015

Count realigned reads in SAM file

Tagged: python NGS samfile mapping






Related Posts

Reads that are spanning InDels (Insertion and Deletion variants) are often misaligned and can result in false positive SNPs (Single Nucleotide Polymorphisms). A popular tool that can re-align these reads is GATK’s IndelRealigner. Once the job is done it’s good to know how many of the reads had been actually re-aligned.

In this blog post, I introduce the pyton script RealignedReadsCount.py, which takes your re-aligned SAM file as an input and informs you about the total number of aligned reads and the number of reads that were re-aligned with GATK. The count of realigned reads is based on the OC tag, which stores the CIGAR string of the reads (see here for an explanation) before they were re-aligned.

Download

You can download the script RealignedReadsCount.py from https://gist.github.com/alj1983/df8f852872fd0f375fa6/download

How to use it?

usage: RealignedReadsCount.py [-h] [-s SAM]

Counts the total number of aligned reads and the number of reads that were
realigned with GATK (based on the OC tag)

optional arguments:
  -h, --help         show this help message and exit
  -s SAM, --sam SAM  sam file name

For example, applied to FILE.sam, the script provides two lines of information:

FILE.sam   total number of reads: 8580266 
FILE.sam number of realigned reads: 99562

Code

import re # to use regular expressions
import argparse # To allow users setting flags in order to define what to filter for.


parser = argparse.ArgumentParser(description="Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)")
parser.add_argument("-s","--sam", help="sam file name")
args=parser.parse_args()

# Define the search patterns:
HeaderPattern= re.compile('^@')
RealignedPattern= re.compile('OC') # This field is a specific tag storing the CIGAR string before realignment was performed

def RealignedReadsCount(samfile):
     '''
     Counts the total number of aligned reads;
     Count the number of realigned reads;
     '''
     sam=open(samfile)
     total=0
     realigned=0
     for line in sam:
         line=line.strip()
         if not HeaderPattern.search(line): # if there is no @ at the beginning of the line
             total=total+1
             if RealignedPattern.search(line): # Look for the field indicating realigned reads
                  realigned=realigned+1

     print samfile, '    total number of reads:', '%10i' % total
     print samfile, 'number of realigned reads:', '%10i' % realigned

RealignedReadsCount(str(args.sam))