Thursday, 7 July 2016

Inferring mRNA from Protein

This problem was fairly straightforward, so I thought I'd try my hand at making my code a bit more readable by dividing it into functions. I have done this many times before during courses, but it has been a while since then.

For this problem we are asked to find all possible mRNA-sequences that a given protein sequence may have come from. Because an amino acid can be encoded by several different codons, the number of sequences that the protein may have come from rapidly gets very large as the protein sequence gets longer, so we are asked for the answer modulo 1000000.

For every possible sequence (s) that encodes the given protein, there are three possible stop codons. This means that we get three variants (s-stop1, s-stop2, s-stop3) out of each possible sequence. So, the total amount of possible sequences becomes three times the number of codons that each amino acid in the protein could be encoded by (lets call it the frequency of the amino acid):

3*Freq(aa1)*Freq(aa2)... and so on

The program I wrote looks like this:


codons = {                                                     
    'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
    'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
    'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
    'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
    'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
    'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
    'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
    'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G' 
}                                                              

def codon_frequence():                                         
    frequence = {}                                             
    for k, v in codons.items():                                
        if v not in frequence:                                 
            frequence[v] = 0                                   
        frequence[v] += 1                                      
    return (frequence)                                         


def possible_sequences(sequence):                              
    f = codon_frequence()                                      
    n = f['Stop']                                              
    for aa in sequence:                                        
        n *= f[aa]                                             
    return (n % 1000000)                                       


with open('sampledata.txt', 'r') as f:                         
    prot_seq = f.read().strip()                                

print(possible_sequences(prot_seq))                            

No comments:

Post a Comment