Sara does Bioinformatics: Finding a Shared Motif

When I first looked at this problem I thought that solving it would be pretty straightforward and that I could somehow reuse some of the code I wrote for Finding a Motif in DNA, but I was wrong. In this problem we are asked to find the longest common substring of a set of DNA-sequences in a FASTA-file. The problem sounds similar to the one presented in Finding a Motif in DNA, but this time we don't know anything about the motif, so using the same approach as last time won't work. Time to rethink.

The first thing however, was as usual to parse the FASTA file and get the data into a suitable format. I chose to put the sequences as strings in a list. The following piece of code does that:

from Bio import SeqIO

sequences = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequence = []

seq = ''

for nt in record.seq:

seq += nt

sequences.append(seq)

handle.close()

What I then wanted to do was to compare all the possible motifs in the shortest sequence to the remaining sequences. To do this I first sorted the list containing the sequences and picked out the shortest one. I could then iterate over all the possible motifs in that sequence and test if they were also present in all of the other sequences. The longest of the motifs that are present in all of the sequences is then saved and printed. This was the easiest solution to the problem that I could think of, but I'll admit it's probably not the neatest or the most efficient solution.

srt_seq = sorted(sequences, key=len)

short_seq = srt_seq[0]

comp_seq = srt_seq[1:]

motif = ''

for i in range(len(short_seq)):

for j in range(i, len(short_seq)):

m = short_seq[i:j + 1]

found = False

for sequ in comp_seq:

if m in sequ:

found = True

else:

found = False

break

if found and len(m) > len(motif):

motif = m

print(motif)

Sara does Bioinformatics

Tuesday, 5 July 2016

Finding a Shared Motif

2 comments: