Tuesday 16 August 2016

Maximum Matchings and RNA Secondary Structures

This problem is very similar to the problem presented in "Perfect Matchings and RNA Secondary Structures". The difference is that in this problem we don't have the same number of A as U and G as C, so we don't get any perfect matchings. Instead we are given the task to find the total possible number of maximum matchings for a given RNA string.

Sample Dataset
>Rosalind_92
AUGCUUC

Expected Output
6

As in "Perfect Matchings and RNA Secondary Structures" the programming is very simple and it all comes down to the maths. We can simplify the problem by looking at AU and GC separately, and can describe the number of maximum matches for each part as max(A,U)!/(max(A,U)-min(A,U))! and max(G,C)!/(max(G,C)-min(G,C))!, respectively. To get the total number of maximum matches, simply multiply the answers.

For the first dataset I tried, I ran the program using Python 3.5 and got the result 434412400153932848464251158517855537428692992. This was deemed incorrect by Rosalind. After seeing a tip online, I ran the program using Python 2.7, and got 434412400153932864602133769790674698240000000 instead. So I tried my program with a new dataset and ran it with Python 2.7 and this time my answer was accepted.

Here follows the code I wrote to solve this problem (works in both 2.7 and 3.5):

from Bio import SeqIO
from math import factorial
sequence = ''
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        sequence = str(record.seq)

A, U, G, C = 0, 0, 0, 0
for nt in sequence:
    if nt == 'A':
        A += 1
    elif nt == 'U':
        U += 1
    elif nt == 'G':
        G += 1
    elif nt == 'C':
        C += 1

AU = factorial(max(A, U)) / factorial(max(A, U) - min(A, U))
GC = factorial(max(G, C)) / factorial(max(G, C) - min(G, C))
print(int(AU * GC))

1 comment:

  1. Hi Sara I'm working to this problem currently and I understand your code but I don't understand the theory behind this. Do you have something that explains the formula you used or could you explain to me the reasoning behind? Do you find this solution in some book?

    ReplyDelete