Reads that are spanning InDels (Insertion and Deletion variants) are often misaligned and can result in false positive SNPs (Single Nucleotide Polymorphisms). A popular tool that can re-align these reads is GATK’s IndelRealigner. Once the job is done it’s good to know how many of the reads had been actually re-aligned.
In this blog post, I introduce the pyton script
RealignedReadsCount.py
, which takes your re-aligned SAM file as an
input and informs you about the total number of aligned reads and the
number of reads that were re-aligned with GATK. The count of
realigned reads is based on the OC
tag, which stores the CIGAR
string of the reads (see here for an explanation) before they were
re-aligned.
You can download the script RealignedReadsCount.py
from
https://gist.github.com/alj1983/df8f852872fd0f375fa6/download
usage: RealignedReadsCount.py [-h] [-s SAM] Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag) optional arguments: -h, --help show this help message and exit -s SAM, --sam SAM sam file name
For example, applied to FILE.sam, the script provides two lines of information:
FILE.sam total number of reads: 8580266 FILE.sam number of realigned reads: 99562
import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. parser = argparse.ArgumentParser(description="Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)") parser.add_argument("-s","--sam", help="sam file name") args=parser.parse_args() # Define the search patterns: HeaderPattern= re.compile('^@') RealignedPattern= re.compile('OC') # This field is a specific tag storing the CIGAR string before realignment was performed def RealignedReadsCount(samfile): ''' Counts the total number of aligned reads; Count the number of realigned reads; ''' sam=open(samfile) total=0 realigned=0 for line in sam: line=line.strip() if not HeaderPattern.search(line): # if there is no @ at the beginning of the line total=total+1 if RealignedPattern.search(line): # Look for the field indicating realigned reads realigned=realigned+1 print samfile, ' total number of reads:', '%10i' % total print samfile, 'number of realigned reads:', '%10i' % realigned RealignedReadsCount(str(args.sam))