Wednesday 6 July 2016

Finding a Protein Motif

In this problem we are given a set of UniProt Protein Database access IDs and asked to find the positions of the N-glycosylation motif (if any) in the corresponding protein sequences. The sequences can be accessed in FASTA format through the following link http://www.uniprot.org/uniprot/uniprot_id.fasta, by substituting uniprot_id with the given IDs.

To solve this problem, my program needed to collect the sequences from uniprot, write them to a FASTA file that could then be search for the motif. Because the motif can look slightly different in different proteins, I opted to use regular expression to search for it. The motif can be written as:

N{P}[ST]{P}

This means that position 1 always is N, position 2 and 4 are any amino acid except P, and position 3 is either A or T. Writing this as a pattern for searching using regular expression we get:

(N[^P][ST][^P])

However, regular expression does not automatically include overlapping patterns, so to include these as well we need to write the pattern like this:

(?=(N[^P][ST][^P]))

The final code can be seen below.


from urllib.request import urlopen                            
from Bio import SeqIO                                         
import re                                                     
ID = []                                                       
with open('sampledata.txt') as f:                             
    for line in f:                                            
        ID.append(line.strip())                               
                                                              
for i in range(len(ID)):                                      
    URL = 'http://www.uniprot.org/uniprot/' + ID[i] + '.fasta'
    data = urlopen(URL)                                       
    fasta = data.read().decode('utf-8', 'ignore')             
    with open('seq_file.fasta', 'a') as text_file:            
        text_file.write(fasta)                                
                                                              
handle = open('seq_file.fasta', 'r')                          
motifs = re.compile(r'(?=(N[^P][ST][^P]))')                   
count = 0                                                     
for record in SeqIO.parse(handle, 'fasta'):                   
    sequence = record.seq                                     
    positions = []                                            
    for m in re.finditer(motifs, str(sequence)):              
                positions.append(m.start() + 1)                               
    if len(positions) > 0:                                    
        print(ID[count])                                      
        print(' '.join(map(str, positions)))                  
    count += 1                                                

5 comments:

  1. Hi Sara, your code looks straightforward and understandable. However, it only works for the sample dataset on Rosalind...for the downloaded dataset, it gives a wrong answer.

    Would you figure out what is wrong with the code?

    ReplyDelete
    Replies
    1. Because the function finditer only finds non-overlapping mottifs.

      Delete
    2. Actually, the code is working for overlapping motifs, due to the lookahead assertionr{?=(N[^P][ST][^P]))}, but not working for downloaded dataset.I also do not know why.

      Delete
  2. It has worked both times I have tried it! Thanks

    ReplyDelete