Sara does Bioinformatics: Error Correction in Reads

In this problem we are looking at the errors that occur during genome sequencing. We are given a set of reads of equal length, some of which contain an error (one nt is exchanged). The reads are either of the following:

The read was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement)
The read is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement)

From this dataset we are expected to find the incorrect reads and correct them.

Sample Dataset

>Rosalind_52

TCATC

>Rosalind_44

TTCAT

>Rosalind_68

TCATC

>Rosalind_28

TGAAA

>Rosalind_95

GAGGA

>Rosalind_66

TTTCA

>Rosalind_33

ATCAA

>Rosalind_21

TTGAT

>Rosalind_18

TTTCC

Expected Output

TTCAT->TTGAT

GAGGA->GATGA

TTTCC->TTTCA

I really enjoyed working on this problem and I managed to solve it pretty quickly. A problem I had though was that I would get the correct result for the sample dataset, but when I tried with a 'real' dataset I got way too many reads in my answer, something like 4 times as many as the input reads. I looked at my output and could conclude that a lot of the reads occurred multiple times. I figured something was wrong in the part of my program where I build the list of correct reads, and sure enough, I managed to find the error and fix it (highlighted below). The following code is what I finally ended up with:

from Bio import SeqIO

from Bio.Seq import Seq

from Bio.Alphabet import generic_dna

reads = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

reads.append(str(record.seq))

handle.close()

right = []

wrong = []

for i, j in enumerate(reads):

read = Seq(j, generic_dna)

rev_read = read.reverse_complement()

for k in range(i + 1, len(reads)):

if read == reads[k] or rev_read == reads[k]:

if read not in right and rev_read not in right:

right.append(str(read))

right.append(str(rev_read))

for l in reads:

if l not in right:

wrong.append(l)

for incorrect in wrong:

for correct in right:

hamming = 0

for nt1, nt2 in zip(incorrect, correct):

if nt1 != nt2:

hamming += 1

if hamming > 2:

break

if hamming == 1:

with open('answer.txt', 'a') as textfile:

print(incorrect, '->', correct, sep='', file=textfile)

Sara does Bioinformatics

Wednesday, 3 August 2016

Error Correction in Reads

No comments:

Post a Comment