- The read was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement)
- The read is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement)
From this dataset we are expected to find the incorrect reads and correct them.
Sample Dataset
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC
Expected Output
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA
I really enjoyed working on this problem and I managed to solve it pretty quickly. A problem I had though was that I would get the correct result for the sample dataset, but when I tried with a 'real' dataset I got way too many reads in my answer, something like 4 times as many as the input reads. I looked at my output and could conclude that a lot of the reads occurred multiple times. I figured something was wrong in the part of my program where I build the list of correct reads, and sure enough, I managed to find the error and fix it (highlighted below). The following code is what I finally ended up with:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
reads = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
reads.append(str(record.seq))
handle.close()
right = []
wrong = []
for i, j in enumerate(reads):
read = Seq(j, generic_dna)
rev_read = read.reverse_complement()
for k in range(i + 1, len(reads)):
if read == reads[k] or rev_read == reads[k]:
if read not in right and rev_read not in right:
right.append(str(read))
right.append(str(rev_read))
for l in reads:
if l not in right:
wrong.append(l)
for incorrect in wrong:
for correct in right:
hamming = 0
for nt1, nt2 in zip(incorrect, correct):
if nt1 != nt2:
hamming += 1
if hamming > 2:
break
if hamming == 1:
with open('answer.txt', 'a') as textfile:
print(incorrect, '->', correct, sep='', file=textfile)
No comments:
Post a Comment