In silico digestion with type IIB restricrion enzymes
Published:
The python script I present in this blog post, programs: to simulate digestion with type IIb restriction enzymes.
Reduced-representation sequencing technologies are often based on DNA digestion with restriction enzymes followed by size-selection of the digested DNA fragments. For RAD-seq and ddRAD-seq, programs like SimRAD (an R package) or RestrictionDigest (a PERL module) allow to simulate restriction digest in order to estimate the number of fragments and, thus, to plan the required sequencing throughput to reach a certain coverage.
However, methods like 2bRAD and MethylRAD rely on type IIb restriction enzymes that do not cut at the recognition sites but at a fixed number of bases upstream and downstream of the recognition sites. Thus, they cut out short double-stranded DNA of uniform length. The python script in a fasta file and saves the short fragments of the digested DNA to a separate fasta file.
For example, I use this script for in-silico digestion of the seagrass (Zostera marina) genome with the restriction enzyme FspeI (MethylRAD) in order to obtain a fasta file with potential restriction fragments. Mapping sequenced FspeI digests back to these fragments allows to estimate sample differences or changes in DNA methylation because FspeI only cuts when the recognition sites are fully methylated.
Download
You can download the script InSilicoTypeIIbDigestion.py
from
https://gist.github.com/alj1983/a1bcb950013d8ffac9a7fb9ceb70b1c0/archive/2bec1913e2454a58074deafd3f7687612f3b9048.zip
How to use it?
These are the arguments for the InSilicoTypeIIbDigestion.py
script:
usage: InSilicoTypeIIbDigestion_corrected.py [-h] [-f FASTA] [-r RECOGNITION] [-d5 DISTANCE5PRIME] [-l LENGTH]
Counts recognition sites of restriction endonucleases and extracts the fragments resulting from an in silico digest with type IIb enzymes
optional arguments: -h, –help show this help message and exit -f FASTA, –fasta FASTA fasta file to extract from -r RECOGNITION, –recognition RECOGNITION Comma separated list of recognition sites -d5 DISTANCE5PRIME, –distance5prime DISTANCE5PRIME Comma separated list of distances (in bp, for each recognition site) from the first base of the recognition site to the cutting site in 5 prime direction -l LENGTH, –length LENGTH Comma separated list of total length (in bp) to cut out for each of the recognition sites
For example, to find the most common recognition sites of the FspeI enzyme, CCGG and CCWGG (where W stands for A or T), we can use:
The first C of the recognition site
python InSilicoTypeIIbDigestion.py
-f Fastafile.fasta
-r CCGG,CC[AT]GG
-d5 14,13
-l 32,31
CCWGG, respectively.
The restriction sites can be given in form of regular expressions, like
order to get a summary for the two included restriction sites, CCAGG
and CCTGG
would have to be provided seaprately. Restriction sites for
several type IIb restriction enzymes are listed in this
publication.
For example, when applying this script on the seagrass (Zostera marina) genome, I get the output information:
Total bases in fasta file: 204058416
CCGG recognition sites: 315518, density in fasta file: 646
CC[AT]GG recognition sites: 203340, density in fasta file: 1003
Here, density means the average number of bases between two recognition sites. It is calculated by dividing the total bases in the fasta file by the number of recognition sites.
In addition, the script saves the fasta file Fastafile_digested.fasta
,
which contains all the in silico digested fragments.