Friday 22 July 2016

Introduction to Random Strings

We know that genomes are not just a random collection of A,T,C and G, and that across a genome there are loads of different motifs, some of which are similar in many different species. But since the human genome is huge, we need to account for the possibility that the subsequence we are looking at was created at random. In this problem we look at a simplified way to calculate the probability that a certain subsequence occurs randomly. We are given a substring of at most 80 bp and an array containing the GC-content of up to 20 random strings, and we are asked to find log10 of the probability that the substring will match the random strings exactly, and return the result in an array of equal length of the given array.

Sample dataset:
ACGATACAA
0.129 0.287 0.423 0.476 0.641 0.742 0.783

Expected output:
-5.737 -5.217 -5.263 -5.360 -5.958 -6.628 -7.009

The probability, P, of the subsequence occuring in a sequence of a GC content x, can be simplified and written as follows, where AC is the total nr of A and C in the subsequence and GC is the total number of G and C:

To solve this problem, we need to write a program that extracts the GC contents of the array and counts AC and GC of the subsequence. Then it's just a simple matter of iterating over the list of GC contents to calculate the probabilities using the above equation. Here is my final version:

import math
AT = 0
GC = 0
with open('sampledata.txt', 'r') as f:
    for line in f:
        if line[0] != 'A' and line[0] != 'T' and line[0] != 'G' and line[
                0] != 'C':
            numbers = line.split()
            GC_contents = [float(x) for x in numbers]
        for i in line:
            if i == 'A' or i == 'T':
                AT += 1
            elif i == 'G' or i == 'C':
                GC += 1

probabilities = []
for j in range(len(GC_contents)):
    prob = math.log10((((1 - GC_contents[j]) / 2)**AT) * (GC_contents[j] / 2)
                      **GC)
    probabilities.append('%0.3f' % prob)
print(*probabilities, sep=' ')

2 comments:

  1. Thank you for your information. I've been confused to solve it for weeks. Maybe because the problem is not explained well.

    ReplyDelete