Monday, 8 August 2016

Speeding Up Motif Finding

In this problem we are looking at how to speed up motif finding using the Knuth-Morris-Pratt algorithm. A good explanation of the algorithm can be found here. We are given a DNA-string and are expected to return the failure array for the sting.

Sample Dataset
>Rosalind_87
CAGCATGGTATCACAGCAGAG

Expected Output
0 0 0 1 2 0 0 0 0 0 0 1 2 1 2 3 4 5 3 0 0

The following code is what I ended up with:

from Bio import SeqIO
record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = list(record.seq)

F_array = [0] * len(sequence)
k = 0
for i in range(2, len(sequence) + 1):
    while k > 0 and sequence[k] != sequence[i - 1]:
        k = F_array[k - 1]
    if sequence[k] == sequence[i - 1]:
        k += 1
    F_array[i - 1] = k

with open('array.txt', 'w') as answer:
    answer.write(' '.join(map(str, F_array)))

No comments:

Post a Comment