Count realigned reads in SAM file

1 minute read

Published:

Reads that are spanning InDels (Insertion and Deletion variants) are often misaligned and can result in false positive SNPs (Single Nucleotide Polymorphisms). A popular tool that can re-align these reads is GATK's IndelRealigner. Once the job is done it's good to know how many of the reads had been actually re-aligned.

In this blog post, I introduce the pyton script file](https://samtools.github.io/hts-specs/SAMv1.pdf) as an input and informs you about the total number of aligned reads and the number of reads that were re-aligned with GATK. The count of realigned reads is based on the OC tag, which stores the CIGAR string of the reads (see here for an explanation) before they were re-aligned.

Download

You can download the script RealignedReadsCount.py from https://gist.github.com/alj1983/df8f852872fd0f375fa6/download

How to use it?

usage: RealignedReadsCount.py [-h] [-s SAM]

Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)

optional arguments: -h, –help show this help message and exit -s SAM, –sam SAM sam file name

For example, applied to FILE.sam, the script provides two lines of information:

FILE.sam total number of reads: 8580266 FILE.sam number of realigned reads: 99562

Code

import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for.

parser = argparse.ArgumentParser(description=”Counts the total number of aligned reads and the number of reads that were realigned with GATK (based on the OC tag)”) parser.add_argument(“-s”,”–sam”, help=”sam file name”) args=parser.parse_args()

Define the search patterns:

HeaderPattern= re.compile(‘^@’) RealignedPattern= re.compile(‘OC’) # This field is a specific tag storing the CIGAR string before realignment was performed

def RealignedReadsCount(samfile): ‘’’ Counts the total number of aligned reads; Count the number of realigned reads; ‘’’ sam=open(samfile) total=0 realigned=0 for line in sam: line=line.strip() if not HeaderPattern.search(line): # if there is no @ at the beginning of the line total=total+1 if RealignedPattern.search(line): # Look for the field indicating realigned reads realigned=realigned+1

 print samfile, '    total number of reads:', '%10i' % total
 print samfile, 'number of realigned reads:', '%10i' % realigned

RealignedReadsCount(str(args.sam))