The python script I present in this blog post,
InSilicoTypeIIbDigestion.py
, fills the grap of most in-silico digest
programs: to simulate digestion with type IIb restriction enzymes.
Reduced-representation sequencing technologies are often based on DNA digestion with restriction enzymes followed by size-selection of the digested DNA fragments. For RAD-seq and ddRAD-seq, programs like SimRAD (an R package) or RestrictionDigest (a PERL module) allow to simulate restriction digest in order to estimate the number of fragments and, thus, to plan the required sequencing throughput to reach a certain coverage.
However, methods like 2bRAD and MethylRAD rely on type IIb restriction
enzymes that do not cut at the recognition sites but at a fixed number
of bases upstream and downstream of the recognition sites. Thus, they
cut out short double-stranded DNA of uniform length. The python script
InSilicoTypeIIbDigestion.py
estimates the number of recognition
sites in a fasta file and saves the short fragments of the digested
DNA to a separate fasta file.
For example, I use this script for in-silico digestion of the seagrass (Zostera marina) genome with the restriction enzyme FspeI (MethylRAD) in order to obtain a fasta file with potential restriction fragments. Mapping sequenced FspeI digests back to these fragments allows to estimate sample differences or changes in DNA methylation because FspeI only cuts when the recognition sites are fully methylated.
You can download the script InSilicoTypeIIbDigestion.py
from
https://gist.github.com/alj1983/a1bcb950013d8ffac9a7fb9ceb70b1c0/archive/2bec1913e2454a58074deafd3f7687612f3b9048.zip
These are the arguments for the InSilicoTypeIIbDigestion.py
script:
usage: InSilicoTypeIIbDigestion_corrected.py [-h] [-f FASTA] [-r RECOGNITION] [-d5 DISTANCE5PRIME] [-l LENGTH] Counts recognition sites of restriction endonucleases and extracts the fragments resulting from an in silico digest with type IIb enzymes optional arguments: -h, --help show this help message and exit -f FASTA, --fasta FASTA fasta file to extract from -r RECOGNITION, --recognition RECOGNITION Comma separated list of recognition sites -d5 DISTANCE5PRIME, --distance5prime DISTANCE5PRIME Comma separated list of distances (in bp, for each recognition site) from the first base of the recognition site to the cutting site in 5 prime direction -l LENGTH, --length LENGTH Comma separated list of total length (in bp) to cut out for each of the recognition sites
For example, to find the most common recognition sites of the FspeI enzyme, CCGG and CCWGG (where W stands for A or T), we can use:
The first C of the recognition site
python InSilicoTypeIIbDigestion.py \ -f Fastafile.fasta \ -r CCGG,CC[AT]GG \ -d5 14,13 \ -l 32,31
-d5
is 14 and 13, and l
is 32 and 31, for recognition site CCGG
and CCWGG, respectively.
The restriction sites can be given in form of regular expressions,
like CC[AT]GG
. This will give a summary for the number of CCWGG
sites. In order to get a summary for the two included restriction
sites, CCAGG
and CCTGG
would have to be provided seaprately.
Restriction sites for several type IIb restriction enzymes are
listed in this publication.
For example, when applying this script on the seagrass (Zostera marina) genome, I get the output information:
Total bases in fasta file: 204058416 CCGG recognition sites: 315518, density in fasta file: 646 CC[AT]GG recognition sites: 203340, density in fasta file: 1003
Here, density means the average number of bases between two recognition sites. It is calculated by dividing the total bases in the fasta file by the number of recognition sites.
In addition, the script saves the fasta file
Fastafile_digested.fasta
, which contains all the in silico digested
fragments.