Wednesday, 17 August 2016

Creating a Distance Matrix

This problem was fairly similar to "Counting Point Mutations" and "Error Correction in Reads". We are given a set of DNA strings and are asked to return the distance matrix of the strings. The distance between two given strings can be calculated by dividing the Hamming distance (i.e. the number of nucleotides that differ between the stings) with the length of the sequences (given that all sequences are of equal length). These values should then be printed on matrix form with 5 significant figures.

Sample Dataset
>Rosalind_9499
TTTCCATTTA
>Rosalind_0942
GATTCATTTC
>Rosalind_6568
TTTCCATTTT
>Rosalind_1833
GTTCCATTTA

Expected Output
0.00000 0.40000 0.10000 0.10000
0.40000 0.00000 0.40000 0.30000
0.10000 0.40000 0.00000 0.20000
0.10000 0.30000 0.20000 0.00000

To write this program I reused parts of the code from "Counting Point Mutations" and "Error Correction in Reads" and modified it to suit this problem. The following is the final code, which took me only 20 minutes to write. Hurray!

from Bio import SeqIO
reads = []
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        reads.append(str(record.seq))

read_len = len(reads[0])
for curr_read in reads:
    distance = []
    for comp_read in reads:
        hamming = 0
        for nt1, nt2 in zip(curr_read, comp_read):
            if nt1 != nt2:
                hamming += 1
        distance.append(str.format('{0:.5f}', hamming / read_len))
    print(*distance, sep=' ')

No comments:

Post a Comment