Tuesday, 9 August 2016

Finding a Shared Spliced Motif

In this problem we are asked to find the longest common subsequence of two DNA-sequences, s and t. As opposed to a substring, a subsequence does not need to occur contiguously in s and t. To solve this problem it might be a good idea to have a look at the Wikipedia page on the longest common subsequence problem.

Sample Dataset
>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA

Expected Output
AACTGG (or any other if there are multiple subsequences of the same length)

The following is a solution to the problem using dynamic programming:

from Bio import SeqIO
sequences = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
    sequences.append(str(record.seq))
s = sequences[0]
t = sequences[1]

lengths = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
#creates array of len(s) containing arrays of len(t) filled with 0
for i, x in enumerate(s):
    for j, y in enumerate(t):
        if x == y:
            lengths[i + 1][j + 1] = lengths[i][j] + 1
        else:
            lengths[i + 1][j + 1] = max(lengths[i + 1][j], lengths[i][j + 1])

spliced_motif = ''
x, y = len(s), len(t)
while x * y != 0:
    if lengths[x][y] == lengths[x - 1][y]:
        x -= 1
    elif lengths[x][y] == lengths[x][y - 1]:
        y -= 1
    else:
        spliced_motif = s[x - 1] + spliced_motif
        x -= 1
        y -= 1
print(spliced_motif)

No comments:

Post a Comment