Wednesday, 3 August 2016

Error Correction in Reads

In this problem we are looking at the errors that occur during genome sequencing. We are given a set of reads of equal length, some of which contain an error (one nt is exchanged). The reads are either of the following:
  1. The read was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement)
  2. The read is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement)
From this dataset we are expected to find the incorrect reads and correct them.

Sample Dataset
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC

Expected Output
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA

I really enjoyed working on this problem and I managed to solve it pretty quickly. A problem I had though was that I would get the correct result for the sample dataset, but when I tried with a 'real' dataset I got way too many reads in my answer, something like 4 times as many as the input reads. I looked at my output and could conclude that a lot of the reads occurred multiple times. I figured something was wrong in the part of my program where I build the list of correct reads, and sure enough, I managed to find the error and fix it (highlighted below). The following code is what I finally ended up with:

from Bio import SeqIO                                                 
from Bio.Seq import Seq                                               
from Bio.Alphabet import generic_dna                                  
reads = []                                                            
handle = open('sampledata.fasta', 'r')                                
for record in SeqIO.parse(handle, 'fasta'):                           
    reads.append(str(record.seq))                                     
handle.close()                                                        

right = []                                                            
wrong = []                                                            
for i, j in enumerate(reads):                                         
    read = Seq(j, generic_dna)                                        
    rev_read = read.reverse_complement()                              
    for k in range(i + 1, len(reads)):                                
        if read == reads[k] or rev_read == reads[k]:                  
            if read not in right and rev_read not in right:           
                right.append(str(read))                               
                right.append(str(rev_read))                           

for l in reads:                                                       
    if l not in right:                                                
        wrong.append(l)                                               

for incorrect in wrong:                                               
    for correct in right:                                             
        hamming = 0                                                   
        for nt1, nt2 in zip(incorrect, correct):                      
            if nt1 != nt2:                                            
                hamming += 1                                          
                if hamming > 2:                                       
                    break                                             
        if hamming == 1:                                              
            with open('answer.txt', 'a') as textfile:                 
                print(incorrect, '->', correct, sep='', file=textfile)





No comments:

Post a Comment