Extract target windows from a fasta file
Published:
You found a set of interesting SNPs that are putatively under selection? Here, I provide two python scripts that will help you to extract windows of putatively adaptive regions around these SNPs. Our working group, for example, uses fasta files with enrich putatively adaptive regions in targeted re-sequencing approaches.
The first of the two python scripts, WindowsFromFasta.py
, extracts
sequence windows from a reference fasta file based on sequence IDs and
positions of putatively adaptive SNPs. The second script,
the same contig. The second script is appropriate for non-model species
with fragmented contigs because the windows stretch likely across linked
genetic regions. For model species with complete chromosomes or large
coherent contigs (sequences with the same sequence ID), instead, the
merged windows may enclose very distant and physically unlinked markers
that should indeed not be combined in a single sequence window.
Download
You can download the script WindowsFromFasta.py
from
https://gist.github.com/alj1983/0d078e138a3ec0a2ee35/download and the
script MergeSequenceWindows.py
from
https://gist.github.com/alj1983/aa80dbc6006082a2d579/download
How to use it?
These are the arguments for the WindowsFromFasta.py
script:
WindowsFromFasta.py [-h] [-f FASTA] [-t TEXT] [-w WINDOWSIZE] [-p POSITIONSPECIFIER]
Extracts sequences from a fasta file based on ID and sequence positions and writes them to a new fasta file
optional arguments: -h, –help show this help message and exit -f FASTA, –fasta FASTA fasta file to extract from -t TEXT, –text TEXT tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column. -w WINDOWSIZE, –windowsize WINDOWSIZE sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position. -p POSITIONSPECIFIER, –positionspecifier POSITIONSPECIFIER string that specified the position which is provided in the second column of the text file. This could be for example ‘SNP’ or ‘windowcenter’ and will appear as comment following the sequence IDs in the output fasta files.
For example, to extract windows of 10 kbp around putatively adaptive SNPs from the genome of Guppy, we can use the following command:
python WindowsFromFasta.py
-f GuppyGenome.fasta
-t SNPsUnderPositiveSelection.txt
-p ‘SNP under positive selection’
-w 10000 \
Guppy_SNPsPositiveSelection.fasta
The text file SNPsUnderPositiveSelection.txt
contains
sequence-identifiers and locations of putatively adaptive SNPs in the
following tab-delimited format:
gi | 612486994 | gb | AZHG01000244.1 | 38922 |
gi | 612486994 | gb | AZHG01000244.1 | 38963 |
gi | 612486902 | gb | AZHG01000336.1 | 1666 |
gi | 612486902 | gb | AZHG01000336.1 | 1716 |
gi | 612486902 | gb | AZHG01000336.1 | 1732 |
gi | 612486902 | gb | AZHG01000336.1 | 1838 |
The sequence-identifiers must match the sequence identifiers in the
fasta file (here, GuppyGenome.fasta
)
The output fasta file, in this case Guppy_SNPsPositiveSelection.fasta
,
contains sequence windows around the specified SNP locations. If the
contig-sequence in the reference fasta file ends before the target
sequence window is complete, the right-hand side of the window will be
truncated to the last base of the contig.
On the other hand, if a SNPs can not be preceded by 5000 nucleotide bases, the window starts with the first available base, like in this example:
gi 612486902 gb AZHG01000336.1 window: 1-6666, SNP under positive selection at 1666
Here, although the target window was 10 kbp, the actual window is 6666 bp long.
In addition, if different SNPs are surrounded by the exact same window, only one single window will be extracted. The sequence id informs about all the SNP locations in this window, like in this case:
gi 612435299 gb AZHG01032030.1 window: 1-4331, SNP under positive selection at 1173, SNP under positive selection at 1448
Here, both ends are truncated. The sequence window spans the entire length of the contig, which is 4331 bp long.
The next step is to merge windows from the same contig with the
usage: MergeFasta.py [-h] [-f FASTA [FASTA …]] [-r REFERENCEFASTA]
Merges fasta files, removes duplicate sequences and merges the sequence-ids of duplicate sequences
optional arguments: -h, –help show this help message and exit -f FASTA [FASTA …], –fasta FASTA [FASTA …] fasta files to merge -r REFERENCEFASTA, –referencefasta REFERENCEFASTA fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files.
The input fasta file(s) are the output files from the windows from all fasta files are merged if they were extracted from the same contig. So, for example, the following command merges windows that are putatively under positive selection with windows that are putatively under negative selection.
python MergeFasta.py
-f Guppy_SNPsPositiveSelection.fasta Guppy_SNPsNegativeSelection.fasta
-r GuppyGenome.fasta \
GuppyWindowsMerged.fasta
One merged window had, for example, the following fasta identifier:
gi 612477484 gb AZHG01009754.1 window with 10000 bp: 1-10000, SNP under negative selection at 5000, SNP under positive selection at 5000
THe window identifier informs about the sequence id; the length of the extracted and merged window, along with its start and end positions; and it informs about the location of putatively adaptive SNPs. To count the number of bases in the final fasta file with all merged windows, you can use the perl script CountBasesInFastaFile.pl.
Code
WindowsFromFasta.py
#!/usr/bin/env python
import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord
parser = argparse.ArgumentParser(description=”Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file”) parser.add_argument(“-f”,”–fasta”, help=”fasta file to extract from”) parser.add_argument(“-t”,”–text”, help=”Tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column”) parser.add_argument(“-w”,”–windowsize”, help=”sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position.”) parser.add_argument(“-p”,”–positionspecifier”, help=”string that specified the position which is provided in the second column of the text file. This could be for example ‘SNP’ or ‘windowcenter’ and will appear as comment following the sequence IDs in the output fasta files”) args=parser.parse_args()
def WindowsFromFasta(fasta,text,windowsize,positionspecifier): ‘’’ Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file ‘’’
windowsize = int(windowsize)
iterator = SeqIO.parse(fasta, "fasta") # Open the fasta file
fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file
# open the text file with the sequence IDs and the start and end positions
textfile = open(text,'U')
unique = {} # serves to connect Sequence IDs to fasta comments
for line in textfile:
line= line.strip().split('\t') # read from a tab-delimited text file
# a combination of the identifier, and the sequence position (the center of the window that shall be extracted).
p = int(line[1])
s = int(line[1])-windowsize/2# If this is negative, I want to set it to 1
if s<1: s=1
e = int(line[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
if e > int(len(fastadictionary[line[0]].seq)): e=int(len(fastadictionary[line[0]].seq))
outname = line[0]
# Remove duplicate windows but keep all positionspecifiers
u = str(outname + str(s) + '-' + str(e))
if u not in unique:
unique[u] = 'window: ' + str(s) + '-' + str(e) + ', ' + positionspecifier + ' at ' + str(line[1]) # Create a comment for this record
else:
existingcomment = unique[u]
unique[u] = existingcomment + ', ' +positionspecifier + ' at ' + str(line[1]) # update comment for this record
textfile.close()
textfile = open(text,'U')
unique2=[] # serves to check for duplicate windows
for l in textfile:
l= l.strip().split('\t') # read from a tab-delimited text file
# a combination of the identifier, and the sequence position (the center of the window that shall be extracted).
p = int(l[1])
s = int(l[1])-windowsize/2# If this is negative, I want to set it to 1
if s<1: s=1
e = int(l[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
if e > int(len(fastadictionary[l[0]].seq)): e=int(len(fastadictionary[l[0]].seq))
outname = l[0]
# Print windows only once but with all positionspecifiers
u = str(outname + str(s) + '-' + str(e))
if u not in unique2:
unique2.append(u)
sequence_record=SeqRecord(fastadictionary[l[0]].seq[s-1:e],id = outname,description = unique[u])
print sequence_record.format("fasta")
WindowsFromFasta(str(args.fasta),str(args.text),int(args.windowsize),str(args.positionspecifier))
MergeSequenceWindows.py
#!/usr/bin/env python
import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord
WindowIdentificationPattern = re.compile(‘.window: (\d+)-(\d+).’) # Extract the window start and end positions from the fasta descriptions parser = argparse.ArgumentParser(description=”Merges windows that were extracted from the same contig”) parser.add_argument(“-f”,”–fasta”, help=”fasta files to merge”,nargs=’+’) parser.add_argument(“-r”,”–referencefasta”, help=”fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files”) args=parser.parse_args()
def MergeSequenceWindows(fasta,referencefasta): ‘’’ Merges windows that were extracted from the same contig ‘’’
openreferencefasta = SeqIO.parse(referencefasta, "fasta") # Open the fasta file
referencefastadictionary = SeqIO.to_dict(openreferencefasta) # Create a dictionary from the fasta file
uniquewindows = {} # serves to connect contigs to different extracted windows
uniquedescriptions = {} # serves to connect contigs to different descriptions (location of SNPs or windowcenters)
for f in fasta:
iterator = SeqIO.parse(f, "fasta") # Open the fasta file
#fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file
records = list(iterator)
for sequence in records:
#print sequence.description
windowidentification = WindowIdentificationPattern.search(str(sequence.description))
actualwindowstart = windowidentification.group(1)
actualwindowend = windowidentification.group(2)
# Get the single SNPs, or windowcenters in a list (will be added to the sequence description)
l= str(sequence.description).strip().split(',')
actualdescriptions = list(l[1:len(l)])
if sequence.id not in uniquewindows:
uniquewindows[sequence.id] = [actualwindowstart,actualwindowend] # This saves the start and end of the window
uniquedescriptions[sequence.id] = actualdescriptions
else:
existingwindowstart = uniquewindows[sequence.id][0]
existingwindowend = uniquewindows[sequence.id][1]
minimumwindowstart = min([int(existingwindowstart),int(actualwindowstart)])
maximumwindowend = max([int(existingwindowend),int(actualwindowend)])
uniquewindows[sequence.id] = [int(minimumwindowstart),int(maximumwindowend)]
existingdescriptions = uniquedescriptions[sequence.id]
existingdescriptions.extend(actualdescriptions)
uniquedescriptions[sequence.id] = existingdescriptions
for uniqueid in uniquewindows:
actualsequence = referencefastadictionary[uniqueid].seq
actualwindowstart = int(uniquewindows[uniqueid][0])
actualwindowend = int(uniquewindows[uniqueid][1])
actualsequencewindow=actualsequence[actualwindowstart-1:actualwindowend-1]
windowlength = actualwindowend-actualwindowstart+1
actualdescriptions = str('window with ' + str(windowlength) + ' bp: ' + str(actualwindowstart) + '-' + str(actualwindowend) + ', ' + ','.join(uniquedescriptions[uniqueid]))
sequence_record=SeqRecord(actualsequencewindow,id = str(uniqueid),description = actualdescriptions)
print sequence_record.format("fasta")
MergeSequenceWindows(args.fasta,str(args.referencefasta))