Determine read-coverage thresholds

5 minute read

Published:

If you want to identify sequence variants like SNPs (Single Nucleotide Polymorphisms) or InDels (Insertions and Deletions) in your sequencing data, you want to avoid regions of too low coverage. The lower the coverage the more difficult it becomes to discriminate sequencing errors from real sequence variants.

Too high coverage, however, may also be problematic and this blog post introduces a python script that can help to identify the problematic high coverage in pileup files. Pileup format is a text-based format that summarizes base calls of aligned reads to a reference sequence (see here for a detailed description: http://samtools.sourceforge.net/pileup.shtml)

Extraordinarily high coverage can result from copy number variations that will lead to false positive variant calls. This gets more clear in Figure Fig.CNV. Here, the two gene copies in the sequenced individual will map to the single gene copy in the reference sequence. This region will be covered by a much higher number of reads than other regions of the reference sequence, and will be identified as a SNP.

::: {.center} How copy number variation can lead to false positive
SNPs.{width=”500px” title=”CNV” alt=”CNV” align=”middle”} :::

There are different approaches to avoid calling false SNPs due to copy number variation. The NGS analysis pipeline popoolation for pooled populations, for example, allows to subsample a pileup file to a uniform coverage with the script subsample-pileup.pl. Others have set an upper coverage threshold, like twice the mean coverage (Lynch et al., 2014), the upper 2% of all coverages (Tobler et al. , 2014) or the mean coverage plus two standard deviations (Beissinger et al. , 2014). Sites with a coverage above this threshold were discarded before calling variants. The python script thresholds.

References:

  • Lynch, Michael, et al. “Population-genetic inference from pooled-sequencing data.” Genome biology and evolution 6.5 (2014): 1210-1218. (Link)

  • Tobler, Ray, et al. “Massive habitat-specific genomic response in D. melanogaster populations during experimental evolution in hot and cold environments.” Molecular biology and evolution 31.2 (2014): 364-375. (Link)

  • Beissinger, Timothy M., et al. “A genome-wide scan for evidence of selection in a maize population under long-term artificial selection for ear number.” Genetics 196.3 (2014): 829-840. (Link)

Download

You can download the script PileupCoverageOverview.py from https://gist.github.com/alj1983/2f0ce22fa28aef4a0922/download

How to use it?

usage: PileupCoverageOverview.py [-h] [-p PILEUP] [-ma MAX] [-mi MIN]

Creates a cumulative histogram on read coverage (excluding coverage of 0)

optional arguments: -h, –help show this help message and exit -p PILEUP, –pileup PILEUP pileup file name -ma MAX, –max MAX maximum coverage to be shown -mi MIN, –min MIN minimum coverage to be shown

The -ma and -min options specify the upper and lower coverage thresholds that will be shown in a cumulative histogram. Independent of these options, however, PileupCoverageOverview.py will always return the following coverage measures (numbers give an example):

Coverage - mean: 220.00 Coverage - max: 12312 Coverage - min: 1 Coverage - mean-2stddev: -760 Coverage - mean+2stddev: 1200 Coverage - upper 2%: 2020 Coverage - lower 2%: 1

Fig. Fig.Histogram shows the cumulative histogram of read coverages, that is based on the same input file. Here for example, 90% of all reads had a read-coverage below 600.

::: {.center} Cumulative histogram of sequence
read-coverage{width=”500px” title=”histogram” alt=”histogram” align=”middle”} :::

Code

import re # to use regular expressions import math import argparse # To allow users setting flags in order to define what to filter for. from pylab import figure, title, xlabel, ylabel, hist, axis, grid, savefig # to draw a historgram

parser = argparse.ArgumentParser(description=”Creates a cumulative histogram on read coverage (excluding coverage of 0) and prints out upper and lower coverage thresholds”) parser.add_argument(“-p”,”–pileup”, help=”pileup file name”) parser.add_argument(“-ma”,”–max”, help=”maximum coverage to be shown”) parser.add_argument(“-mi”,”–min”, help=”minimum coverage to be shown”) args=parser.parse_args()

Define the search patterns:

PileupFileNamePattern = re.compile(‘(.*).pileup’)

def Coverages(pileupfile,maximum,minimum): ‘’’ Creates a cumulative histogram on read coverage (excluding coverage of 0) and prints out upper and lower coverage thresholds ‘’’ pileup=open(pileupfile) OutfilePrefix1 =PileupFileNamePattern.match(pileupfile) # define outfile OutfilePrefix=OutfilePrefix1.group(1)
Coverage=[] #Create an empty list for line in pileup: line=line.strip() splitted=line.split(‘\t’) # Split the lines by tab separators if abs(int(splitted[3])) > 0: # excluding all coverages of 0 Coverage.append(abs(int(splitted[3]))) # Pick out the 4th column, which is the number of reads that covered this position # Remove coverages of 0 CoverageAboveZero=[] for c in Coverage: if c>0:
CoverageAboveZero.append(c) # sort the coverages CoverageAboveZeroSorted = CoverageAboveZero.sort() TotalBasesCovered = len(CoverageAboveZeroSorted) # Identify the lower 2% limit LowerLimit=CoverageAboveZeroSorted[int((0.02TotalBasesCovered)-1)] # Identify the upper 2% limit UpperLimit=CoverageAboveZeroSorted[int((0.98TotalBasesCovered)-1)] # Calculating the standard deviation mean=float(sum(CoverageAboveZeroSorted)/len(CoverageAboveZeroSorted)) total = 0.0 for value in CoverageAboveZeroSorted: total += (value - mean) ** 2 stddev = math.sqrt(total / (len(CoverageAboveZeroSorted)-1)) print pileupfile, ‘Coverage - mean:’, ‘%10.2f’ % mean print pileupfile, ‘Coverage - max:’, ‘%10i’ % max(CoverageAboveZeroSorted) print pileupfile, ‘Coverage - min:’, ‘%10i’ % min(CoverageAboveZeroSorted) print pileupfile, ‘Coverage - mean-2stddev:’, ‘%10i’ % mean-2stddev print pileupfile, ‘Coverage - mean+2stddev:’, ‘%10i’ % mean+2stddev print pileupfile, ‘Coverage - upper 2%:’, ‘%10i’ % UpperLimit print pileupfile, ‘Coverage - lower 2%:’, ‘%10i’ % LowerLimit

 # Drawing the historgram
 figure()
 hist(CoverageAboveZeroSorted, range=(minimum,maximum), bins=50, histtype='bar', facecolor='green', normed=1,cumulative=1)
 title('Cumulative Histogram')
 xlabel('Coverage')
 ylabel('Frequency')
 axis()
 grid(True)
 savefig(OutfilePrefix + 'Coverage_Histogram.png')

Coverages(str(args.pileup),int(args.max),int(args.min))