We are asked to compare two sequences of equal length and classify the mutations as either transitions (substituting a purine to another purine or a pyrimidine to another pyrimidine) or transversions (substituting a purine to a pyrimidine or vice versa). We should then return the transition/transversion ratio for the sequences.
Sample dataset:
>Rosalind_0209
GCAACGCACAACGAAAACCCTTAGGGACTGGATTATTTCGTGATCGTTGTAGTTATTGGA
AGTACGGGCATCAACCCAGTT
>Rosalind_2200
TTATCTGACAAAGAAAGCCGTCAACGGCTGGATAATTTCGCGATCGTGCTGGTTACTGGC
GGTACGAGTGTTCCTTTGGGT
Expected output:
1.21428571429
The problem is very similar to Counting Point Mutations in which we calculated the Hamming distance. I used my code from that problem as a starting point and this is the altered code:
from Bio import SeqIO
sequences = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
sequences.append(str(record.seq))
handle.close()
s1 = sequences[0]
s2 = sequences[1]
transition = 0
transversion = 0
AG = ['A', 'G']
CT = ['C', 'T']
for nt1, nt2 in zip(s1, s2):
if nt1 != nt2:
if nt1 in AG and nt2 in AG:
transition += 1
elif nt1 in CT and nt2 in CT:
transition += 1
else:
transversion += 1
print('%0.11f' % (transition / transversion))