Friday 1 July 2016

Overlap Graphs

In this Rosalind problem we are entering (or at least peeking in on) the world of graph theory. Graph theory has a huge number of applications in a wide variety of fields. In biology it can be used for tracking diseases, to look for breeding patterns or, as in this case, finding an overlap graph for a set of sequences.

For a given collection of sequences in FASTA-format, we are given the task to print the adjacency list of the overlap graph of the sequences with the overlap length of 3 bp. To do this, we need to compare the suffixes of all the sequences to the prefixes. When a match is found the ID of the two sequences should be printed (the ID of the sequence containing the suffix should be printed on the left and the other one should be printed on the right). To avoid directed loops in the overlap graph, we should not print any sequences that have a suffix that matches its own prefix.

Below follows my somewhat messy but working code:

from Bio import SeqIO                                                  
prefixes = []                                                          
suffixes = []                                                          
handle = open('samplefile.fasta', 'r')                                 
for record in SeqIO.parse(handle, 'fasta'):                            
    count1 = 0                                                         
    count2 = 0                                                         
    prefix = [record.id]                                               
    suffix = [record.id]                                               
    pre = ''                                                           
    suf = ''                                                           
    for nt in record.seq:                                              
        if count1 < 3:                                                 
            pre += nt                                                  
            count1 += 1                                                
    prefix.append(pre)                                                 
    for tn in reversed(record.seq):                                    
        if count2 < 3:                                                 
            suf += tn                                                  
            count2 += 1                                                
    suffix.append(''.join(reversed(suf)))                              
    prefixes.append(prefix)                                            
    suffixes.append(suffix)                                            
handle.close()                                                         
                                                                       
for i, k in enumerate(suffixes):                                       
    currentsf = suffixes[i][1]                                         
    currentid = suffixes[i][0]                                         
    for j, l in enumerate(prefixes):                                   
        if currentsf == prefixes[j][1] and currentid != prefixes[j][0]:
            print(currentid, prefixes[j][0])                           

In the first part of this program I load the prefixes and suffixes into two separate lists together with their corresponding ID. At first I tried using dictionaries for this because I thought it would be easy to extract the corresponding ID's when a match was found. Unfortunately this wasn't as straightforward as I initially thought because many of the values (the suffixes and prefixes, respectively) were identical. I wanted to use the value in one dictionary to find the corresponding key to that value in the other dictionary. I spent quite a bit of time trying to get this to work, trying several different approaches, but finally I had to give up and decided to use lists instead. However, the attempt to use dictionaries was not completely in vain, because now I know a lot more about dictionaries!

No comments:

Post a Comment