My first thought was to use Biopython for this, because surely there is a built in function for this sort of problem. This turned out not to be the case, but in the Biopython documentation there is a description on how to identify open reading frames. I tried this, and managed to print some protein sequences with it, but they were completely wrong compared to the ones shown in the Rosalind example. I tried to fix the code, but finally I gave up and decided to write my own code instead, using regular expressions.
The first thing the program needs to do after reading the given FASTA-file is to save the forward strand and then also pick out the reverse complement of it. Then we need to define the pattern that should be searched for using regular expression. In this case, we want to find the parts of the sequence that start with a start codon and ends with a stop codon. For regex this can be written as:
(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))
The program should then look for this pattern in both the forward strand and the reverse complement. If an ORF is found it should be translated into the corresponding protein sequence and if that one is unique it should be printed. The following code is what I ended up with. Again, not the prettiest program ever written, but I managed to write it within an hour.
import re
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
record = SeqIO.read('sampledata.fasta', 'fasta')
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
frw_seq = record.seq
rev_seq = frw_seq.reverse_complement()
sequences = []
for m in re.findall(pattern, str(frw_seq)):
dna_seq = Seq(m, generic_dna)
prot_seq = dna_seq.translate()
if prot_seq not in sequences:
sequences.append(prot_seq)
for n in re.findall(pattern, str(rev_seq)):
rev_dna_seq = Seq(n, generic_dna)
rev_prot_seq = rev_dna_seq.translate()
if rev_prot_seq not in sequences:
sequences.append(rev_prot_seq)
for i, s in enumerate(sequences):
print(s)
No comments:
Post a Comment