Tuesday 5 July 2016

Finding a Shared Motif

When I first looked at this problem I thought that solving it would be pretty straightforward and that I could somehow reuse some of the code I wrote for Finding a Motif in DNA, but I was wrong. In this problem we are asked to find the longest common substring of a set of DNA-sequences in a FASTA-file. The problem sounds similar to the one presented in Finding a Motif in DNA, but this time we don't know anything about the motif, so using the same approach as last time won't work. Time to rethink.

The first thing however, was as usual to parse the FASTA file and get the data into a suitable format. I chose to put the sequences as strings in a list. The following piece of code does that:

from Bio import SeqIO                      
sequences = []                             
handle = open('sampledata.fasta', 'r')     
for record in SeqIO.parse(handle, 'fasta'):
    sequence = []                          
    seq = ''                               
    for nt in record.seq:                  
        seq += nt                          
    sequences.append(seq)                  
handle.close()                             

What I then wanted to do was to compare all the possible motifs in the shortest sequence to the remaining sequences. To do this I first sorted the list containing the sequences and picked out the shortest one. I could then iterate over all the possible motifs in that sequence and test if they were also present in all of the other sequences. The longest of the motifs that are present in all of the sequences is then saved and printed. This was the easiest solution to the problem that I could think of, but I'll admit it's probably not the neatest or the most efficient solution.

srt_seq = sorted(sequences, key=len)     
short_seq = srt_seq[0]                   
comp_seq = srt_seq[1:]                   
motif = ''                               
for i in range(len(short_seq)):          
    for j in range(i, len(short_seq)):   
        m = short_seq[i:j + 1]           
        found = False                    
        for sequ in comp_seq:            
            if m in sequ:                
                found = True             
            else:                        
                found = False            
                break                    
        if found and len(m) > len(motif):
            motif = m                    
print(motif)                             

2 comments:

  1. Thanks for sharing! I am using Rosalind to get in some practice this summer and this was just what I was looking for!

    ReplyDelete
  2. Thank you so much for sharing! This was very helpful!

    ReplyDelete