Friday 8 July 2016

Open Reading Frames

This time we are asked to find all the open reading frames of a DNA-sequence and to translate these into the different proteins they encode. Only unique proteins should be printed and we need to remember that there are in total six different reading frames ( 3 for each DNA-strand).

My first thought was to use Biopython for this, because surely there is a built in function for this sort of problem. This turned out not to be the case, but in the Biopython documentation there is a description on how to identify open reading frames. I tried this, and managed to print some protein sequences with it, but they were completely wrong compared to the ones shown in the Rosalind example. I tried to fix the code, but finally I gave up and decided to write my own code instead, using regular expressions.

The first thing the program needs to do after reading the given FASTA-file is to save the forward strand and then also pick out the reverse complement of it. Then we need to define the pattern that should be searched for using regular expression. In this case, we want to find the parts of the sequence that start with a start codon and ends with a stop codon. For regex this can be written as:

(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))

The program should then look for this pattern in both the forward strand and the reverse complement. If an ORF is found it should be translated into the corresponding protein sequence and if that one is unique it should be printed. The following code is what I ended up with. Again, not the prettiest program ever written, but I managed to write it within an hour.

import re                                                 
from Bio import SeqIO                                     
from Bio.Seq import Seq                                   
from Bio.Alphabet import generic_dna                      

record = SeqIO.read('sampledata.fasta', 'fasta')          
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
frw_seq = record.seq                                      
rev_seq = frw_seq.reverse_complement()                    
sequences = []                                            

for m in re.findall(pattern, str(frw_seq)):               
    dna_seq = Seq(m, generic_dna)                         
    prot_seq = dna_seq.translate()                        
    if prot_seq not in sequences:                         
        sequences.append(prot_seq)                        
for n in re.findall(pattern, str(rev_seq)):               
    rev_dna_seq = Seq(n, generic_dna)                     
    rev_prot_seq = rev_dna_seq.translate()                
    if rev_prot_seq not in sequences:                     
        sequences.append(rev_prot_seq)                    

for i, s in enumerate(sequences):                         
    print(s)                                              

No comments:

Post a Comment