marinetics blog


Bioinformatics Solutions in Ecology and Evolution

19 Jul 2015

Extract target windows from a fasta file

Tagged: python NGS Fasta






Related Posts

You found a set of interesting SNPs that are putatively under selection? Here, I provide two python scripts that will help you to extract windows of putatively adaptive regions around these SNPs. Our working group, for example, uses fasta files with enrich putatively adaptive regions in targeted re-sequencing approaches.

The first of the two python scripts, WindowsFromFasta.py, extracts sequence windows from a reference fasta file based on sequence IDs and positions of putatively adaptive SNPs. The second script, MergeSequenceWindows.py, merges these windows if they are located on the same contig. The second script is appropriate for non-model species with fragmented contigs because the windows stretch likely across linked genetic regions. For model species with complete chromosomes or large coherent contigs (sequences with the same sequence ID), instead, the merged windows may enclose very distant and physically unlinked markers that should indeed not be combined in a single sequence window.

Download

You can download the script WindowsFromFasta.py from https://gist.github.com/alj1983/0d078e138a3ec0a2ee35/download and the script MergeSequenceWindows.py from https://gist.github.com/alj1983/aa80dbc6006082a2d579/download

How to use it?

These are the arguments for the WindowsFromFasta.py script:

WindowsFromFasta.py [-h] [-f FASTA] [-t TEXT] [-w WINDOWSIZE]
                           [-p POSITIONSPECIFIER]

Extracts sequences from a fasta file based on ID and sequence positions and
writes them to a new fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        fasta file to extract from
  -t TEXT, --text TEXT  tab-delimited text file containing the 
                        sequence-id in the first column and the
                        position of the targeted window-centers 
                        in the second column.
  -w WINDOWSIZE, --windowsize WINDOWSIZE
                        sum of the number of basepairs preceding and 
                        following the position that is specified in 
                        the text file. A window of 10 kb means that 
                        5 kb shall precede and 5 kb shall follow the 
                        speciied position.
  -p POSITIONSPECIFIER, --positionspecifier POSITIONSPECIFIER
                        string that specified the position which is 
                        provided in the second column of the text 
                        file. This could be for example 'SNP' or 
                        'windowcenter' and will appear as comment
                        following the sequence IDs in the output 
                        fasta files.

For example, to extract windows of 10 kbp around putatively adaptive SNPs from the genome of Guppy, we can use the following command:

python WindowsFromFasta.py \
-f GuppyGenome.fasta \
-t SNPsUnderPositiveSelection.txt \
-p 'SNP under positive selection' \
-w 10000 \
> Guppy_SNPsPositiveSelection.fasta

The text file SNPsUnderPositiveSelection.txt contains sequence-identifiers and locations of putatively adaptive SNPs in the following tab-delimited format:

gi|612486994|gb|AZHG01000244.1| 38922
gi|612486994|gb|AZHG01000244.1| 38963
gi|612486902|gb|AZHG01000336.1| 1666
gi|612486902|gb|AZHG01000336.1| 1716
gi|612486902|gb|AZHG01000336.1| 1732
gi|612486902|gb|AZHG01000336.1| 1838

The sequence-identifiers must match the sequence identifiers in the fasta file (here, GuppyGenome.fasta)

The output fasta file, in this case Guppy_SNPsPositiveSelection.fasta, contains sequence windows around the specified SNP locations. If the contig-sequence in the reference fasta file ends before the target sequence window is complete, the right-hand side of the window will be truncated to the last base of the contig.

On the other hand, if a SNPs can not be preceded by 5000 nucleotide bases, the window starts with the first available base, like in this example:

>gi|612486902|gb|AZHG01000336.1| window: 1-6666, SNP under positive selection at 1666

Here, although the target window was 10 kbp, the actual window is 6666 bp long.

In addition, if different SNPs are surrounded by the exact same window, only one single window will be extracted. The sequence id informs about all the SNP locations in this window, like in this case:

>gi|612435299|gb|AZHG01032030.1| window: 1-4331, SNP under positive selection at 1173, SNP under positive selection at 1448

Here, both ends are truncated. The sequence window spans the entire length of the contig, which is 4331 bp long.

The next step is to merge windows from the same contig with the MergeFasta.py script. The arguments are:

usage: MergeFasta.py [-h] [-f FASTA [FASTA ...]] [-r REFERENCEFASTA]

Merges fasta files, removes duplicate sequences and merges the sequence-ids of
duplicate sequences

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA [FASTA ...], --fasta FASTA [FASTA ...]
                        fasta files to merge
  -r REFERENCEFASTA, --referencefasta REFERENCEFASTA
                        fasta file of the genome/transcriptome that was 
                        used to extract the windows in the single fasta 
                        files.

The input fasta file(s) are the output files from the WindowsFromFasta.py script. If more than one file is provided, the windows from all fasta files are merged if they were extracted from the same contig. So, for example, the following command merges windows that are putatively under positive selection with windows that are putatively under negative selection.

python MergeFasta.py \
-f Guppy_SNPsPositiveSelection.fasta Guppy_SNPsNegativeSelection.fasta \
-r GuppyGenome.fasta \
 > GuppyWindowsMerged.fasta

One merged window had, for example, the following fasta identifier:

> gi|612477484|gb|AZHG01009754.1| window with 10000 bp: 1-10000, SNP under negative selection at 5000, SNP under positive selection at 5000

THe window identifier informs about the sequence id; the length of the extracted and merged window, along with its start and end positions; and it informs about the location of putatively adaptive SNPs. To count the number of bases in the final fasta file with all merged windows, you can use the perl script CountBasesInFastaFile.pl.

Code

WindowsFromFasta.py

#!/usr/bin/env python 

import re # to use regular expressions
import argparse # To allow users setting flags in order to define what to filter for.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord


parser = argparse.ArgumentParser(description="Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file")
parser.add_argument("-f","--fasta", help="fasta file to extract from")
parser.add_argument("-t","--text", help="Tab-delimited text file containing the sequence-id in the first column and the
                        position of the targeted window-centers in the second column")
parser.add_argument("-w","--windowsize", help="sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position.")
parser.add_argument("-p","--positionspecifier", help="string that specified the position which is provided in the second column of the text file. This could be for example 'SNP' or 'windowcenter' and will appear as comment following the sequence IDs in the output fasta files")
args=parser.parse_args()


def WindowsFromFasta(fasta,text,windowsize,positionspecifier):
        '''
        Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file
        '''

        windowsize = int(windowsize)

        iterator = SeqIO.parse(fasta, "fasta") # Open the fasta file 
        fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file

        # open the text file with the sequence IDs and the start and end positions
        textfile = open(text,'U')
        unique = {} # serves to connect Sequence IDs to fasta comments 

        for line in textfile:
            line= line.strip().split('\t') # read from a tab-delimited text file
            # a combination of the identifier, and the sequence position (the center of the window that shall be extracted).

            p = int(line[1])
            s = int(line[1])-windowsize/2# If this is negative, I want to set it to 1
            if s<1: s=1
            e = int(line[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
            if e > int(len(fastadictionary[line[0]].seq)): e=int(len(fastadictionary[line[0]].seq))

            outname = line[0]

            # Remove duplicate windows but keep all positionspecifiers

            u = str(outname + str(s) + '-' + str(e))
            if u not in unique:
                unique[u] =  'window: ' + str(s) + '-' + str(e) + ', ' + positionspecifier + ' at ' + str(line[1]) # Create a comment for this record      
            else:
                existingcomment = unique[u]
                unique[u] =  existingcomment +  ', ' +positionspecifier + ' at ' + str(line[1]) # update comment for this record

        textfile.close()

        textfile = open(text,'U')
        unique2=[] # serves to check for duplicate windows

        for l in textfile:
            l= l.strip().split('\t') # read from a tab-delimited text file
            # a combination of the identifier, and the sequence position (the center of the window that shall be extracted).

            p = int(l[1])
            s = int(l[1])-windowsize/2# If this is negative, I want to set it to 1
            if s<1: s=1
            e = int(l[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
            if e > int(len(fastadictionary[l[0]].seq)): e=int(len(fastadictionary[l[0]].seq))

            outname = l[0]

            # Print windows only once but with all positionspecifiers
            u = str(outname + str(s) + '-' + str(e))
            if u not in unique2:
                unique2.append(u)
                sequence_record=SeqRecord(fastadictionary[l[0]].seq[s-1:e],id = outname,description = unique[u])
                print sequence_record.format("fasta")




WindowsFromFasta(str(args.fasta),str(args.text),int(args.windowsize),str(args.positionspecifier))

MergeSequenceWindows.py

#!/usr/bin/env python 


import re # to use regular expressions
import argparse # To allow users setting flags in order to define what to filter for.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

WindowIdentificationPattern = re.compile('.*window: (\d+)-(\d+).*') # Extract the window start and end positions from the fasta descriptions
parser = argparse.ArgumentParser(description="Merges windows that were extracted from the same contig")
parser.add_argument("-f","--fasta", help="fasta files to merge",nargs='+')
parser.add_argument("-r","--referencefasta", help="fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files")
args=parser.parse_args()


def MergeSequenceWindows(fasta,referencefasta):
        '''
        Merges windows that were extracted from the same contig
        '''

        openreferencefasta = SeqIO.parse(referencefasta, "fasta") # Open the fasta file 
        referencefastadictionary = SeqIO.to_dict(openreferencefasta) # Create a dictionary from the fasta file

        uniquewindows = {} # serves to connect contigs to different extracted windows
        uniquedescriptions = {} # serves to connect contigs to different descriptions (location of SNPs or windowcenters)
        for f in fasta:
            iterator = SeqIO.parse(f, "fasta") # Open the fasta file 
            #fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file
            records = list(iterator)
            for sequence in records:
                #print sequence.description
                windowidentification = WindowIdentificationPattern.search(str(sequence.description))
                actualwindowstart = windowidentification.group(1)
                actualwindowend = windowidentification.group(2)

                # Get the single SNPs, or windowcenters in a list (will be added to the sequence description)
                l= str(sequence.description).strip().split(',')
                actualdescriptions = list(l[1:len(l)])
                if sequence.id not in uniquewindows:
                    uniquewindows[sequence.id] = [actualwindowstart,actualwindowend] # This saves the start and end of the window
                    uniquedescriptions[sequence.id] = actualdescriptions
                else:
                    existingwindowstart = uniquewindows[sequence.id][0]
                    existingwindowend = uniquewindows[sequence.id][1]
                    minimumwindowstart = min([int(existingwindowstart),int(actualwindowstart)])
                    maximumwindowend = max([int(existingwindowend),int(actualwindowend)])
                    uniquewindows[sequence.id] = [int(minimumwindowstart),int(maximumwindowend)]

                    existingdescriptions = uniquedescriptions[sequence.id]
                    existingdescriptions.extend(actualdescriptions)
                    uniquedescriptions[sequence.id] = existingdescriptions


        for uniqueid in uniquewindows:
            actualsequence = referencefastadictionary[uniqueid].seq
            actualwindowstart = int(uniquewindows[uniqueid][0])
            actualwindowend = int(uniquewindows[uniqueid][1])
            actualsequencewindow=actualsequence[actualwindowstart-1:actualwindowend-1]
            windowlength = actualwindowend-actualwindowstart+1
            actualdescriptions = str('window with ' + str(windowlength) + ' bp: ' + str(actualwindowstart) + '-' + str(actualwindowend) + ', ' + ','.join(uniquedescriptions[uniqueid]))
            sequence_record=SeqRecord(actualsequencewindow,id = str(uniqueid),description = actualdescriptions)                
            print sequence_record.format("fasta")               

MergeSequenceWindows(args.fasta,str(args.referencefasta))