For a given collection of sequences in FASTA-format, we are given the task to print the adjacency list of the overlap graph of the sequences with the overlap length of 3 bp. To do this, we need to compare the suffixes of all the sequences to the prefixes. When a match is found the ID of the two sequences should be printed (the ID of the sequence containing the suffix should be printed on the left and the other one should be printed on the right). To avoid directed loops in the overlap graph, we should not print any sequences that have a suffix that matches its own prefix.
Below follows my somewhat messy but working code:
from Bio import SeqIO
prefixes = []
suffixes = []
handle = open('samplefile.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
count1 = 0
count2 = 0
prefix = [record.id]
suffix = [record.id]
pre = ''
suf = ''
for nt in record.seq:
if count1 < 3:
pre += nt
count1 += 1
prefix.append(pre)
for tn in reversed(record.seq):
if count2 < 3:
suf += tn
count2 += 1
suffix.append(''.join(reversed(suf)))
prefixes.append(prefix)
suffixes.append(suffix)
handle.close()
for i, k in enumerate(suffixes):
currentsf = suffixes[i][1]
currentid = suffixes[i][0]
for j, l in enumerate(prefixes):
if currentsf == prefixes[j][1] and currentid != prefixes[j][0]:
print(currentid, prefixes[j][0])
In the first part of this program I load the prefixes and suffixes into two separate lists together with their corresponding ID. At first I tried using dictionaries for this because I thought it would be easy to extract the corresponding ID's when a match was found. Unfortunately this wasn't as straightforward as I initially thought because many of the values (the suffixes and prefixes, respectively) were identical. I wanted to use the value in one dictionary to find the corresponding key to that value in the other dictionary. I spent quite a bit of time trying to get this to work, trying several different approaches, but finally I had to give up and decided to use lists instead. However, the attempt to use dictionaries was not completely in vain, because now I know a lot more about dictionaries!
No comments:
Post a Comment