marinetics blog


Bioinformatics Solutions in Ecology and Evolution

11 Apr 2017

In silico digestion with type IIB restricrion enzymes

Tagged: python NGS Epigenetics Fasta






Related Posts

The python script I present in this blog post, InSilicoTypeIIbDigestion.py, fills the grap of most in-silico digest programs: to simulate digestion with type IIb restriction enzymes.

Reduced-representation sequencing technologies are often based on DNA digestion with restriction enzymes followed by size-selection of the digested DNA fragments. For RAD-seq and ddRAD-seq, programs like SimRAD (an R package) or RestrictionDigest (a PERL module) allow to simulate restriction digest in order to estimate the number of fragments and, thus, to plan the required sequencing throughput to reach a certain coverage.

However, methods like 2bRAD and MethylRAD rely on type IIb restriction enzymes that do not cut at the recognition sites but at a fixed number of bases upstream and downstream of the recognition sites. Thus, they cut out short double-stranded DNA of uniform length. The python script InSilicoTypeIIbDigestion.py estimates the number of recognition sites in a fasta file and saves the short fragments of the digested DNA to a separate fasta file.

For example, I use this script for in-silico digestion of the seagrass (Zostera marina) genome with the restriction enzyme FspeI (MethylRAD) in order to obtain a fasta file with potential restriction fragments. Mapping sequenced FspeI digests back to these fragments allows to estimate sample differences or changes in DNA methylation because FspeI only cuts when the recognition sites are fully methylated.

How to use it?

These are the arguments for the InSilicoTypeIIbDigestion.py script:

usage: InSilicoTypeIIbDigestion_corrected.py [-h] [-f FASTA] [-r RECOGNITION]
                                             [-d5 DISTANCE5PRIME] [-l LENGTH]

Counts recognition sites of restriction endonucleases and extracts the
fragments resulting from an in silico digest with type IIb enzymes

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        fasta file to extract from
  -r RECOGNITION, --recognition RECOGNITION
                        Comma separated list of recognition sites
  -d5 DISTANCE5PRIME, --distance5prime DISTANCE5PRIME
                        Comma separated list of distances (in bp, for each
                        recognition site) from the first base of the
                        recognition site to the cutting site in 5 prime
                        direction
  -l LENGTH, --length LENGTH
                        Comma separated list of total length (in bp) to cut
                        out for each of the recognition sites

For example, to find the most common recognition sites of the FspeI enzyme, CCGG and CCWGG (where W stands for A or T), we can use:

The first C of the recognition site

python InSilicoTypeIIbDigestion.py \
-f Fastafile.fasta \
-r CCGG,CC[AT]GG \
-d5 14,13 \
-l 32,31

-d5 is 14 and 13, and l is 32 and 31, for recognition site CCGG and CCWGG, respectively.

The restriction sites can be given in form of regular expressions, like CC[AT]GG. This will give a summary for the number of CCWGG sites. In order to get a summary for the two included restriction sites, CCAGG and CCTGG would have to be provided seaprately. Restriction sites for several type IIb restriction enzymes are listed in this publication.

For example, when applying this script on the seagrass (Zostera marina) genome, I get the output information:

Total bases in fasta file: 204058416
CCGG recognition sites: 315518, density in fasta file: 646
CC[AT]GG recognition sites: 203340, density in fasta file: 1003

Here, density means the average number of bases between two recognition sites. It is calculated by dividing the total bases in the fasta file by the number of recognition sites.

In addition, the script saves the fasta file Fastafile_digested.fasta, which contains all the in silico digested fragments.