Sample dataset:
ACGATACAA
0.129 0.287 0.423 0.476 0.641 0.742 0.783
Expected output:
-5.737 -5.217 -5.263 -5.360 -5.958 -6.628 -7.009
The probability, P, of the subsequence occuring in a sequence of a GC content x, can be simplified and written as follows, where AC is the total nr of A and C in the subsequence and GC is the total number of G and C:
To solve this problem, we need to write a program that extracts the GC contents of the array and counts AC and GC of the subsequence. Then it's just a simple matter of iterating over the list of GC contents to calculate the probabilities using the above equation. Here is my final version:
import math
AT = 0
GC = 0
with open('sampledata.txt', 'r') as f:
for line in f:
if line[0] != 'A' and line[0] != 'T' and line[0] != 'G' and line[
0] != 'C':
numbers = line.split()
GC_contents = [float(x) for x in numbers]
for i in line:
if i == 'A' or i == 'T':
AT += 1
elif i == 'G' or i == 'C':
GC += 1
probabilities = []
for j in range(len(GC_contents)):
prob = math.log10((((1 - GC_contents[j]) / 2)**AT) * (GC_contents[j] / 2)
**GC)
probabilities.append('%0.3f' % prob)
print(*probabilities, sep=' ')
Thank you for your information. I've been confused to solve it for weeks. Maybe because the problem is not explained well.
ReplyDeleteMany thanks
ReplyDelete