Sara does Bioinformatics: Genome Assembly as Shortest Superstring

Before reading any further in this post, I feel that I must warn you: my solution to this problem is very inefficient, not to say very ugly and probably hard to follow for anyone who isn't me. I know, great programming practice right here! I am still a bit proud of the solution though, because even though it might be the most hideous thing you have seen this month (year?) I managed to come up with it almost entirely without looking things up! And, hey, at least it managed to finish within the five minute limit! So here it goes... Please try not to laugh too hard at me...

Problem Description:
For a set of sequences in FASTA format, we are asked to find the shortest superstring that they make up. The criteria on the sequences is that there exists a unique way to reconstruct the superstring from them by putting together pairs of reads that overlap by more than half their length.

Sample Dataset:
>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59

GCCGGAATAC

Expected output:
ATTAGACCTGCCGGAATAC

Final code:
from Bio import SeqIO
reads = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
reads.append(str(record.seq))
handle.close()

overlaps = []
overlapping = []
for i in range(len(reads)):
curr_read = reads[i]
for j in range(len(curr_read) // 2, len(curr_read)):
curr_suffix = curr_read[-(j + 1):]
for k in range(len(reads)):
curr_comp = reads[k]
for l in range(len(curr_comp) // 2, len(curr_comp)):
curr_prefix = curr_comp[:l]
if curr_suffix == curr_prefix:
overlaps.append(k)
overlapping.append([len(curr_suffix), i, k])

s = set(overlaps)
first_read = ''
count = len(overlapping)
for m in range(len(overlapping)):
suf_index = overlapping[m][1]
if suf_index not in s: #find first read and initialise new str
first_read = suf_index
new_str = reads[overlapping[m][1]] + reads[overlapping[m][2]][
overlapping[m][0]:]
count -= 1
pref_index = overlapping[m][2]
while count > 0: #when the first read is found, add
for n in range(len(overlapping)): #the remaining in the correct order
if pref_index == overlapping[n][1]:
new_str += reads[overlapping[n][2]][overlapping[n][0]:]
pref_index = overlapping[n][2]
count -= 1

print(new_str)

Yeah, so... Sorry about that.

Sara does Bioinformatics

Wednesday, 20 July 2016

Genome Assembly as Shortest Superstring

No comments:

Post a Comment