Count realigned reads in SAM file
Published:
Reads that are spanning InDels (Insertion and Deletion variants) are often misaligned and can result in false positive SNPs (Single Nucleotide Polymorphisms). A popular tool that can re-align these reads is GATK's IndelRealigner. Once the job is done it's good to know how many of the reads had been actually re-aligned.
In this blog post, I introduce the pyton script
file](https://samtools.github.io/hts-specs/SAMv1.pdf) as an input and
informs you about the total number of aligned reads and the number of
reads that were re-aligned with GATK. The count of realigned reads is
based on the OC
tag, which stores the CIGAR string of the reads (see
here for an explanation) before they were re-aligned.
Download
You can download the script RealignedReadsCount.py
from
https://gist.github.com/alj1983/df8f852872fd0f375fa6/download
How to use it?
usage: RealignedReadsCount.py [-h] [-s SAM]
Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)
optional arguments: -h, –help show this help message and exit -s SAM, –sam SAM sam file name
For example, applied to FILE.sam, the script provides two lines of information:
FILE.sam total number of reads: 8580266 FILE.sam number of realigned reads: 99562
Code
import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for.
parser = argparse.ArgumentParser(description=”Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)”) parser.add_argument(“-s”,”–sam”, help=”sam file name”) args=parser.parse_args()
Define the search patterns:
HeaderPattern= re.compile(‘^@’) RealignedPattern= re.compile(‘OC’) # This field is a specific tag storing the CIGAR string before realignment was performed
def RealignedReadsCount(samfile): ‘’’ Counts the total number of aligned reads; Count the number of realigned reads; ‘’’ sam=open(samfile) total=0 realigned=0 for line in sam: line=line.strip() if not HeaderPattern.search(line): # if there is no @ at the beginning of the line total=total+1 if RealignedPattern.search(line): # Look for the field indicating realigned reads realigned=realigned+1
print samfile, ' total number of reads:', '%10i' % total
print samfile, 'number of realigned reads:', '%10i' % realigned
RealignedReadsCount(str(args.sam))