Extract target windows from a fasta file

10 minute read

Published:

You found a set of interesting SNPs that are putatively under selection? Here, I provide two python scripts that will help you to extract windows of putatively adaptive regions around these SNPs. Our working group, for example, uses fasta files with enrich putatively adaptive regions in targeted re-sequencing approaches.

The first of the two python scripts, WindowsFromFasta.py, extracts sequence windows from a reference fasta file based on sequence IDs and positions of putatively adaptive SNPs. The second script, the same contig. The second script is appropriate for non-model species with fragmented contigs because the windows stretch likely across linked genetic regions. For model species with complete chromosomes or large coherent contigs (sequences with the same sequence ID), instead, the merged windows may enclose very distant and physically unlinked markers that should indeed not be combined in a single sequence window.

Download

You can download the script WindowsFromFasta.py from https://gist.github.com/alj1983/0d078e138a3ec0a2ee35/download and the script MergeSequenceWindows.py from https://gist.github.com/alj1983/aa80dbc6006082a2d579/download

How to use it?

These are the arguments for the WindowsFromFasta.py script:

WindowsFromFasta.py [-h] [-f FASTA] [-t TEXT] [-w WINDOWSIZE] [-p POSITIONSPECIFIER]

Extracts sequences from a fasta file based on ID and sequence positions and writes them to a new fasta file

optional arguments: -h, –help show this help message and exit -f FASTA, –fasta FASTA fasta file to extract from -t TEXT, –text TEXT tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column. -w WINDOWSIZE, –windowsize WINDOWSIZE sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position. -p POSITIONSPECIFIER, –positionspecifier POSITIONSPECIFIER string that specified the position which is provided in the second column of the text file. This could be for example ‘SNP’ or ‘windowcenter’ and will appear as comment following the sequence IDs in the output fasta files.

For example, to extract windows of 10 kbp around putatively adaptive SNPs from the genome of Guppy, we can use the following command:

python WindowsFromFasta.py
-f GuppyGenome.fasta
-t SNPsUnderPositiveSelection.txt
-p ‘SNP under positive selection’
-w 10000 \

Guppy_SNPsPositiveSelection.fasta

The text file SNPsUnderPositiveSelection.txt contains sequence-identifiers and locations of putatively adaptive SNPs in the following tab-delimited format:

gi 612486994 gb AZHG01000244.1 38922
gi 612486994 gb AZHG01000244.1 38963
gi 612486902 gb AZHG01000336.1 1666
gi 612486902 gb AZHG01000336.1 1716
gi 612486902 gb AZHG01000336.1 1732
gi 612486902 gb AZHG01000336.1 1838

The sequence-identifiers must match the sequence identifiers in the fasta file (here, GuppyGenome.fasta)

The output fasta file, in this case Guppy_SNPsPositiveSelection.fasta, contains sequence windows around the specified SNP locations. If the contig-sequence in the reference fasta file ends before the target sequence window is complete, the right-hand side of the window will be truncated to the last base of the contig.

On the other hand, if a SNPs can not be preceded by 5000 nucleotide bases, the window starts with the first available base, like in this example:

gi 612486902 gb AZHG01000336.1 window: 1-6666, SNP under positive selection at 1666

Here, although the target window was 10 kbp, the actual window is 6666 bp long.

In addition, if different SNPs are surrounded by the exact same window, only one single window will be extracted. The sequence id informs about all the SNP locations in this window, like in this case:

gi 612435299 gb AZHG01032030.1 window: 1-4331, SNP under positive selection at 1173, SNP under positive selection at 1448

Here, both ends are truncated. The sequence window spans the entire length of the contig, which is 4331 bp long.

The next step is to merge windows from the same contig with the

usage: MergeFasta.py [-h] [-f FASTA [FASTA …]] [-r REFERENCEFASTA]

Merges fasta files, removes duplicate sequences and merges the sequence-ids of duplicate sequences

optional arguments: -h, –help show this help message and exit -f FASTA [FASTA …], –fasta FASTA [FASTA …] fasta files to merge -r REFERENCEFASTA, –referencefasta REFERENCEFASTA fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files.

The input fasta file(s) are the output files from the windows from all fasta files are merged if they were extracted from the same contig. So, for example, the following command merges windows that are putatively under positive selection with windows that are putatively under negative selection.

python MergeFasta.py
-f Guppy_SNPsPositiveSelection.fasta Guppy_SNPsNegativeSelection.fasta
-r GuppyGenome.fasta \

GuppyWindowsMerged.fasta

One merged window had, for example, the following fasta identifier:

gi 612477484 gb AZHG01009754.1 window with 10000 bp: 1-10000, SNP under negative selection at 5000, SNP under positive selection at 5000

THe window identifier informs about the sequence id; the length of the extracted and merged window, along with its start and end positions; and it informs about the location of putatively adaptive SNPs. To count the number of bases in the final fasta file with all merged windows, you can use the perl script CountBasesInFastaFile.pl.

Code

WindowsFromFasta.py

#!/usr/bin/env python

import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord

parser = argparse.ArgumentParser(description=”Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file”) parser.add_argument(“-f”,”–fasta”, help=”fasta file to extract from”) parser.add_argument(“-t”,”–text”, help=”Tab-delimited text file containing the sequence-id in the first column and the position of the targeted window-centers in the second column”) parser.add_argument(“-w”,”–windowsize”, help=”sum of the number of basepairs preceding and following the position that is specified in the text file. A window of 10 kb means that 5 kb shall precede and 5 kb shall follow the speciied position.”) parser.add_argument(“-p”,”–positionspecifier”, help=”string that specified the position which is provided in the second column of the text file. This could be for example ‘SNP’ or ‘windowcenter’ and will appear as comment following the sequence IDs in the output fasta files”) args=parser.parse_args()

def WindowsFromFasta(fasta,text,windowsize,positionspecifier): ‘’’ Extracts sequence windows from a fasta file based on sequence IDs and positions and writes them to a new fasta file ‘’’

    windowsize = int(windowsize)

    iterator = SeqIO.parse(fasta, "fasta") # Open the fasta file 
    fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file

    # open the text file with the sequence IDs and the start and end positions
    textfile = open(text,'U')
    unique = {} # serves to connect Sequence IDs to fasta comments 

    for line in textfile:
        line= line.strip().split('\t') # read from a tab-delimited text file
        # a combination of the identifier, and the sequence position (the center of the window that shall be extracted).

        p = int(line[1])
        s = int(line[1])-windowsize/2# If this is negative, I want to set it to 1
        if s<1: s=1
        e = int(line[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
        if e > int(len(fastadictionary[line[0]].seq)): e=int(len(fastadictionary[line[0]].seq))

        outname = line[0]

        # Remove duplicate windows but keep all positionspecifiers

        u = str(outname + str(s) + '-' + str(e))
        if u not in unique:
            unique[u] =  'window: ' + str(s) + '-' + str(e) + ', ' + positionspecifier + ' at ' + str(line[1]) # Create a comment for this record      
        else:
            existingcomment = unique[u]
            unique[u] =  existingcomment +  ', ' +positionspecifier + ' at ' + str(line[1]) # update comment for this record

    textfile.close()

    textfile = open(text,'U')
    unique2=[] # serves to check for duplicate windows

    for l in textfile:
        l= l.strip().split('\t') # read from a tab-delimited text file
        # a combination of the identifier, and the sequence position (the center of the window that shall be extracted).

        p = int(l[1])
        s = int(l[1])-windowsize/2# If this is negative, I want to set it to 1
        if s<1: s=1
        e = int(l[1])+windowsize/2 # An index that is too large doesn't give an error if it is used in slicing. So, if the string has no base at position 20000, it will print out the last base before that. However, I want to inform then what the last base is
        if e > int(len(fastadictionary[l[0]].seq)): e=int(len(fastadictionary[l[0]].seq))

        outname = l[0]

        # Print windows only once but with all positionspecifiers
        u = str(outname + str(s) + '-' + str(e))
        if u not in unique2:
            unique2.append(u)
            sequence_record=SeqRecord(fastadictionary[l[0]].seq[s-1:e],id = outname,description = unique[u])
            print sequence_record.format("fasta")

WindowsFromFasta(str(args.fasta),str(args.text),int(args.windowsize),str(args.positionspecifier))

MergeSequenceWindows.py

#!/usr/bin/env python

import re # to use regular expressions import argparse # To allow users setting flags in order to define what to filter for. from Bio import SeqIO from Bio.SeqRecord import SeqRecord

WindowIdentificationPattern = re.compile(‘.window: (\d+)-(\d+).’) # Extract the window start and end positions from the fasta descriptions parser = argparse.ArgumentParser(description=”Merges windows that were extracted from the same contig”) parser.add_argument(“-f”,”–fasta”, help=”fasta files to merge”,nargs=’+’) parser.add_argument(“-r”,”–referencefasta”, help=”fasta file of the genome/transcriptome that was used to extract the windows in the single fasta files”) args=parser.parse_args()

def MergeSequenceWindows(fasta,referencefasta): ‘’’ Merges windows that were extracted from the same contig ‘’’

    openreferencefasta = SeqIO.parse(referencefasta, "fasta") # Open the fasta file 
    referencefastadictionary = SeqIO.to_dict(openreferencefasta) # Create a dictionary from the fasta file

    uniquewindows = {} # serves to connect contigs to different extracted windows
    uniquedescriptions = {} # serves to connect contigs to different descriptions (location of SNPs or windowcenters)
    for f in fasta:
        iterator = SeqIO.parse(f, "fasta") # Open the fasta file 
        #fastadictionary = SeqIO.to_dict(iterator) # Create a dictionary from the fasta file
        records = list(iterator)
        for sequence in records:
            #print sequence.description
            windowidentification = WindowIdentificationPattern.search(str(sequence.description))
            actualwindowstart = windowidentification.group(1)
            actualwindowend = windowidentification.group(2)

            # Get the single SNPs, or windowcenters in a list (will be added to the sequence description)
            l= str(sequence.description).strip().split(',')
            actualdescriptions = list(l[1:len(l)])
            if sequence.id not in uniquewindows:
                uniquewindows[sequence.id] = [actualwindowstart,actualwindowend] # This saves the start and end of the window
                uniquedescriptions[sequence.id] = actualdescriptions
            else:
                existingwindowstart = uniquewindows[sequence.id][0]
                existingwindowend = uniquewindows[sequence.id][1]
                minimumwindowstart = min([int(existingwindowstart),int(actualwindowstart)])
                maximumwindowend = max([int(existingwindowend),int(actualwindowend)])
                uniquewindows[sequence.id] = [int(minimumwindowstart),int(maximumwindowend)]

                existingdescriptions = uniquedescriptions[sequence.id]
                existingdescriptions.extend(actualdescriptions)
                uniquedescriptions[sequence.id] = existingdescriptions


    for uniqueid in uniquewindows:
        actualsequence = referencefastadictionary[uniqueid].seq
        actualwindowstart = int(uniquewindows[uniqueid][0])
        actualwindowend = int(uniquewindows[uniqueid][1])
        actualsequencewindow=actualsequence[actualwindowstart-1:actualwindowend-1]
        windowlength = actualwindowend-actualwindowstart+1
        actualdescriptions = str('window with ' + str(windowlength) + ' bp: ' + str(actualwindowstart) + '-' + str(actualwindowend) + ', ' + ','.join(uniquedescriptions[uniqueid]))
        sequence_record=SeqRecord(actualsequencewindow,id = str(uniqueid),description = actualdescriptions)                
        print sequence_record.format("fasta")               

MergeSequenceWindows(args.fasta,str(args.referencefasta))