You found a set of interesting SNPs that are putatively under selection? Here, I provide two python scripts that will help you to extract windows of putatively adaptive regions around these SNPs. Our working group, for example, uses fasta files with enrich putatively adaptive regions in targeted re-sequencing approaches.
The first of the two python scripts, WindowsFromFasta.py
, extracts
sequence windows from a reference fasta file based on sequence IDs and
positions of putatively adaptive SNPs. The second script,
MergeSequenceWindows.py
, merges these windows if they are located on
the same contig. The second script is appropriate for non-model
species with fragmented contigs because the windows stretch likely
across linked genetic regions. For model species with complete
chromosomes or large coherent contigs (sequences with the same
sequence ID), instead, the merged windows may enclose very distant and
physically unlinked markers that should indeed not be combined in a
single sequence window.
You can download the script WindowsFromFasta.py
from
https://gist.github.com/alj1983/0d078e138a3ec0a2ee35/download and the
script MergeSequenceWindows.py
from
https://gist.github.com/alj1983/aa80dbc6006082a2d579/download
These are the arguments for the WindowsFromFasta.py
script:
WindowsFromFasta.py [-h] [-f FASTA] [-t TEXT] [-w WINDOWSIZE] [-p POSITIONSPECIFIER] Extracts sequences from a fasta file based on ID and sequence positions and writes them to a new fasta file optional arguments: -h, --help show this help message and exit -f FASTA, --fasta FASTA fasta file to extract from -t TEXT, --text TEXT tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column. -w WINDOWSIZE, --windowsize WINDOWSIZE sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position. -p POSITIONSPECIFIER, --positionspecifier POSITIONSPECIFIER string that specified the position which is provided in the second column of the text file. This could be for example 'SNP' or 'windowcenter' and will appear as comment following the sequence IDs in the output fasta files.
For example, to extract windows of 10 kbp around putatively adaptive SNPs from the genome of Guppy, we can use the following command:
python WindowsFromFasta.py \ -f GuppyGenome.fasta \ -t SNPsUnderPositiveSelection.txt \ -p 'SNP under positive selection' \ -w 10000 \ > Guppy_SNPsPositiveSelection.fasta
The text file SNPsUnderPositiveSelection.txt
contains
sequence-identifiers and locations of putatively adaptive SNPs in the
following tab-delimited format:
gi|612486994|gb|AZHG01000244.1| 38922 gi|612486994|gb|AZHG01000244.1| 38963 gi|612486902|gb|AZHG01000336.1| 1666 gi|612486902|gb|AZHG01000336.1| 1716 gi|612486902|gb|AZHG01000336.1| 1732 gi|612486902|gb|AZHG01000336.1| 1838
The sequence-identifiers must match the sequence
identifiers in the fasta file (here, GuppyGenome.fasta
)
The output fasta file, in this case
Guppy_SNPsPositiveSelection.fasta
, contains sequence windows around
the specified SNP locations. If the contig-sequence in the reference
fasta file ends before the target sequence window is complete, the
right-hand side of the window will be truncated to the last base of
the contig.
On the other hand, if a SNPs can not be preceded by 5000 nucleotide bases, the window starts with the first available base, like in this example:
>gi|612486902|gb|AZHG01000336.1| window: 1-6666, SNP under positive selection at 1666
Here, although the target window was 10 kbp, the actual window is 6666 bp long.
In addition, if different SNPs are surrounded by the exact same window, only one single window will be extracted. The sequence id informs about all the SNP locations in this window, like in this case:
>gi|612435299|gb|AZHG01032030.1| window: 1-4331, SNP under positive selection at 1173, SNP under positive selection at 1448
Here, both ends are truncated. The sequence window spans the entire length of the contig, which is 4331 bp long.
The next step is to merge windows from the same contig with the
MergeFasta.py
script. The arguments are:
usage: MergeFasta.py [-h] [-f FASTA [FASTA ...]] [-r REFERENCEFASTA]
Merges fasta files, removes duplicate sequences and merges the sequence-ids of
duplicate sequences
optional arguments:
-h, --help show this help message and exit
-f FASTA [FASTA ...], --fasta FASTA [FASTA ...]
fasta files to merge
-r REFERENCEFASTA, --referencefasta REFERENCEFASTA
fasta file of the genome/transcriptome that was
used to extract the windows in the single fasta
files.
The input fasta file(s) are the output files from the
WindowsFromFasta.py
script. If more than one file is provided, the
windows from all fasta files are merged if they were extracted from
the same contig. So, for example, the following command merges
windows that are putatively under positive selection with windows that
are putatively under negative selection.
python MergeFasta.py \ -f Guppy_SNPsPositiveSelection.fasta Guppy_SNPsNegativeSelection.fasta \ -r GuppyGenome.fasta \ > GuppyWindowsMerged.fasta
One merged window had, for example, the following fasta identifier:
> gi|612477484|gb|AZHG01009754.1| window with 10000 bp: 1-10000, SNP under negative selection at 5000, SNP under positive selection at 5000
THe window identifier informs about the sequence id; the length of the extracted and merged window, along with its start and end positions; and it informs about the location of putatively adaptive SNPs. To count the number of bases in the final fasta file with all merged windows, you can use the perl script CountBasesInFastaFile.pl.
#!/usr/bin/env python import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord parser = argparse.ArgumentParser(description="Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file") parser.add_argument("-f","--fasta", help="fasta file to extract from") parser.add_argument("-t","--text", help="Tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column") parser.add_argument("-w","--windowsize", help="sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position.") parser.add_argument("-p","--positionspecifier", help="string that specified the position which is provided in the second column of the text file. This could be for example 'SNP' or 'windowcenter' and will appear as comment following the sequence IDs in the output fasta files") args=parser.parse_args() def WindowsFromFasta(fasta,text,windowsize,positionspecifier): ''' Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file ''' windowsize = int(windowsize) iterator = SeqIO.parse(fasta, "fasta") # Open the fasta file fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file # open the text file with the sequence IDs and the start and end positions textfile = open(text,'U') unique = {} # serves to connect Sequence IDs to fasta comments for line in textfile: line= line.strip().split('\t') # read from a tab-delimited text file # a combination of the identifier, and the sequence position (the center of the window that shall be extracted). p = int(line[1]) s = int(line[1])-windowsize/2# If this is negative, I want to set it to 1 if s<1: s=1 e = int(line[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is if e > int(len(fastadictionary[line[0]].seq)): e=int(len(fastadictionary[line[0]].seq)) outname = line[0] # Remove duplicate windows but keep all positionspecifiers u = str(outname + str(s) + '-' + str(e)) if u not in unique: unique[u] = 'window: ' + str(s) + '-' + str(e) + ', ' + positionspecifier + ' at ' + str(line[1]) # Create a comment for this record else: existingcomment = unique[u] unique[u] = existingcomment + ', ' +positionspecifier + ' at ' + str(line[1]) # update comment for this record textfile.close() textfile = open(text,'U') unique2=[] # serves to check for duplicate windows for l in textfile: l= l.strip().split('\t') # read from a tab-delimited text file # a combination of the identifier, and the sequence position (the center of the window that shall be extracted). p = int(l[1]) s = int(l[1])-windowsize/2# If this is negative, I want to set it to 1 if s<1: s=1 e = int(l[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is if e > int(len(fastadictionary[l[0]].seq)): e=int(len(fastadictionary[l[0]].seq)) outname = l[0] # Print windows only once but with all positionspecifiers u = str(outname + str(s) + '-' + str(e)) if u not in unique2: unique2.append(u) sequence_record=SeqRecord(fastadictionary[l[0]].seq[s-1:e],id = outname,description = unique[u]) print sequence_record.format("fasta") WindowsFromFasta(str(args.fasta),str(args.text),int(args.windowsize),str(args.positionspecifier))
#!/usr/bin/env python import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord WindowIdentificationPattern = re.compile('.*window: (\d+)-(\d+).*') # Extract the window start and end positions from the fasta descriptions parser = argparse.ArgumentParser(description="Merges windows that were extracted from the same contig") parser.add_argument("-f","--fasta", help="fasta files to merge",nargs='+') parser.add_argument("-r","--referencefasta", help="fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files") args=parser.parse_args() def MergeSequenceWindows(fasta,referencefasta): ''' Merges windows that were extracted from the same contig ''' openreferencefasta = SeqIO.parse(referencefasta, "fasta") # Open the fasta file referencefastadictionary = SeqIO.to_dict(openreferencefasta) # Create a dictionary from the fasta file uniquewindows = {} # serves to connect contigs to different extracted windows uniquedescriptions = {} # serves to connect contigs to different descriptions (location of SNPs or windowcenters) for f in fasta: iterator = SeqIO.parse(f, "fasta") # Open the fasta file #fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file records = list(iterator) for sequence in records: #print sequence.description windowidentification = WindowIdentificationPattern.search(str(sequence.description)) actualwindowstart = windowidentification.group(1) actualwindowend = windowidentification.group(2) # Get the single SNPs, or windowcenters in a list (will be added to the sequence description) l= str(sequence.description).strip().split(',') actualdescriptions = list(l[1:len(l)]) if sequence.id not in uniquewindows: uniquewindows[sequence.id] = [actualwindowstart,actualwindowend] # This saves the start and end of the window uniquedescriptions[sequence.id] = actualdescriptions else: existingwindowstart = uniquewindows[sequence.id][0] existingwindowend = uniquewindows[sequence.id][1] minimumwindowstart = min([int(existingwindowstart),int(actualwindowstart)]) maximumwindowend = max([int(existingwindowend),int(actualwindowend)]) uniquewindows[sequence.id] = [int(minimumwindowstart),int(maximumwindowend)] existingdescriptions = uniquedescriptions[sequence.id] existingdescriptions.extend(actualdescriptions) uniquedescriptions[sequence.id] = existingdescriptions for uniqueid in uniquewindows: actualsequence = referencefastadictionary[uniqueid].seq actualwindowstart = int(uniquewindows[uniqueid][0]) actualwindowend = int(uniquewindows[uniqueid][1]) actualsequencewindow=actualsequence[actualwindowstart-1:actualwindowend-1] windowlength = actualwindowend-actualwindowstart+1 actualdescriptions = str('window with ' + str(windowlength) + ' bp: ' + str(actualwindowstart) + '-' + str(actualwindowend) + ', ' + ','.join(uniquedescriptions[uniqueid])) sequence_record=SeqRecord(actualsequencewindow,id = str(uniqueid),description = actualdescriptions) print sequence_record.format("fasta") MergeSequenceWindows(args.fasta,str(args.referencefasta))