Sara does Bioinformatics: Finding a Protein Motif

Wednesday, 6 July 2016

Finding a Protein Motif

In this problem we are given a set of UniProt Protein Database access IDs and asked to find the positions of the N-glycosylation motif (if any) in the corresponding protein sequences. The sequences can be accessed in FASTA format through the following link http://www.uniprot.org/uniprot/uniprot_id.fasta, by substituting uniprot_id with the given IDs.

To solve this problem, my program needed to collect the sequences from uniprot, write them to a FASTA file that could then be search for the motif. Because the motif can look slightly different in different proteins, I opted to use regular expression to search for it. The motif can be written as:

N{P}[ST]{P}

This means that position 1 always is N, position 2 and 4 are any amino acid except P, and position 3 is either A or T. Writing this as a pattern for searching using regular expression we get:

(N[^P][ST][^P])

However, regular expression does not automatically include overlapping patterns, so to include these as well we need to write the pattern like this:

(?=(N[^P][ST][^P]))

The final code can be seen below.

from urllib.request import urlopen

from Bio import SeqIO

import re

ID = []

with open('sampledata.txt') as f:

for line in f:

ID.append(line.strip())

for i in range(len(ID)):

URL = 'http://www.uniprot.org/uniprot/' + ID[i] + '.fasta'

data = urlopen(URL)

fasta = data.read().decode('utf-8', 'ignore')

with open('seq_file.fasta', 'a') as text_file:

text_file.write(fasta)

handle = open('seq_file.fasta', 'r')

motifs = re.compile(r'(?=(N[^P][ST][^P]))')

count = 0

for record in SeqIO.parse(handle, 'fasta'):

sequence = record.seq

positions = []

for m in re.finditer(motifs, str(sequence)):

positions.append(m.start() + 1)

if len(positions) > 0:

print(ID[count])

print(' '.join(map(str, positions)))

count += 1

5 comments:

Unknown3 March 2019 at 20:15
Hi Sara, your code looks straightforward and understandable. However, it only works for the sample dataset on Rosalind...for the downloaded dataset, it gives a wrong answer.

Would you figure out what is wrong with the code?
ReplyDelete
Replies
Ana24 January 2022 at 08:34
It has worked both times I have tried it! Thanks
ReplyDelete
Replies

Add comment