The first thing however, was as usual to parse the FASTA file and get the data into a suitable format. I chose to put the sequences as strings in a list. The following piece of code does that:
from Bio import SeqIO
sequences = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
sequence = []
seq = ''
for nt in record.seq:
seq += nt
sequences.append(seq)
handle.close()
What I then wanted to do was to compare all the possible motifs in the shortest sequence to the remaining sequences. To do this I first sorted the list containing the sequences and picked out the shortest one. I could then iterate over all the possible motifs in that sequence and test if they were also present in all of the other sequences. The longest of the motifs that are present in all of the sequences is then saved and printed. This was the easiest solution to the problem that I could think of, but I'll admit it's probably not the neatest or the most efficient solution.
srt_seq = sorted(sequences, key=len)
short_seq = srt_seq[0]
comp_seq = srt_seq[1:]
motif = ''
for i in range(len(short_seq)):
for j in range(i, len(short_seq)):
m = short_seq[i:j + 1]
found = False
for sequ in comp_seq:
if m in sequ:
found = True
else:
found = False
break
if found and len(m) > len(motif):
motif = m
print(motif)
Thanks for sharing! I am using Rosalind to get in some practice this summer and this was just what I was looking for!
ReplyDeleteThank you so much for sharing! This was very helpful!
ReplyDelete