Sara does Bioinformatics: Expected Number of Restriction Sites

This problem is very similar to "Introduction to Random Strings". As in that problem, we are given a string, s, and an array, A, containing some GC contents. However, in this case we are also given an integer, n, representing the length of a second string, t. t is the random string formed with each given GC content and we are asked to find the probability of finding s as a substring of each t. What we need to realise to solve this problem is that the number of opportunities for finding s in t is equal to n-len(s)+1. To get the overall probability of finding s in t, we simply need to add all the individual probabilities of randomly forming s with the specified GC contents.

Sample Dataset
10
AG
0.25 0.5 0.75

Expected Output
0.422 0.563 0.422

The following is the code I wrote to solve the problem. When I wrote it I became aware that there is a difference in how Python3 and Python2 handles rounding of 0.5. In Python2 it is rounded up, but in Python3 it is rounded down. On an earlier problem I had to run my program using Python2 because the answer I got from Python3 wasn't accepted by Rosalind. In this case however, the sample data set I got yielded the same answer regardless of which version I used (suggesting it no cases of having to round 0.5 occurred in this data set). (Note: if the following code is to be run with Python2 the formating of the output needs to be rewritten for it to work).

data = []
with open('rosalind_eval.txt', 'r') as f:
    for line in f:
        data.append(line.strip('\n'))
n = int(data[0])
s = data[1]
A = [float(x) for x in data[2].split()]

AT, GC = 0, 0
for nt in s:
    if nt == 'A' or nt == 'T':
        AT += 1
    elif nt == 'G' or nt == 'C':
        GC += 1

B = [None]*len(A)
for i, j in enumerate(A):
    P = (((1 - j)/2)**AT)*((j/2)**GC)*(n - len(s)+1)
    B[i] = '%0.3f' % P
print(*B, sep=' ')

Sara does Bioinformatics

Thursday, 25 August 2016

Expected Number of Restriction Sites

No comments:

Post a Comment