To solve this problem, my program needed to collect the sequences from uniprot, write them to a FASTA file that could then be search for the motif. Because the motif can look slightly different in different proteins, I opted to use regular expression to search for it. The motif can be written as:
N{P}[ST]{P}
This means that position 1 always is N, position 2 and 4 are any amino acid except P, and position 3 is either A or T. Writing this as a pattern for searching using regular expression we get:
(N[^P][ST][^P])
However, regular expression does not automatically include overlapping patterns, so to include these as well we need to write the pattern like this:
(?=(N[^P][ST][^P]))
The final code can be seen below.
from urllib.request import urlopen
from Bio import SeqIO
import re
ID = []
with open('sampledata.txt') as f:
for line in f:
ID.append(line.strip())
for i in range(len(ID)):
URL = 'http://www.uniprot.org/uniprot/' + ID[i] + '.fasta'
data = urlopen(URL)
fasta = data.read().decode('utf-8', 'ignore')
with open('seq_file.fasta', 'a') as text_file:
text_file.write(fasta)
handle = open('seq_file.fasta', 'r')
motifs = re.compile(r'(?=(N[^P][ST][^P]))')
count = 0
for record in SeqIO.parse(handle, 'fasta'):
sequence = record.seq
positions = []
for m in re.finditer(motifs, str(sequence)):
positions.append(m.start() + 1)
if len(positions) > 0:
print(ID[count])
print(' '.join(map(str, positions)))
count += 1
Hi Sara, your code looks straightforward and understandable. However, it only works for the sample dataset on Rosalind...for the downloaded dataset, it gives a wrong answer.
ReplyDeleteWould you figure out what is wrong with the code?
Because the function finditer only finds non-overlapping mottifs.
DeleteActually, the code is working for overlapping motifs, due to the lookahead assertionr{?=(N[^P][ST][^P]))}, but not working for downloaded dataset.I also do not know why.
DeleteIt worked today for me.
DeleteIt has worked both times I have tried it! Thanks
ReplyDelete