Wednesday 14 September 2016

Distances in Trees

In this problem we are looking att the Newick format and how to find the distance between two nodes in a phylogenetic tree. We are given a file containing trees in Newick format and two nodes for each tree, and are asked to find the distance between those nodes.

Sample Dataset
(cat)dog;
dog cat

(dog,cat);
dog cat

Expected Output
1 2

I remember working with the Newick format before, in one of the bioinformatics courses I took, so when I started working on this problem I recalled that there were functions for the format available in Biopython. So I had a look at the documentation, and sure enough, there is a function called distance that would be suitable. However, as always, there was a slight problem. The trees given by Rosalind did not contain any branch lengths, which is what the distance function uses to calculate the distance between two nodes. To enable using this function I therefor had to assign the branches a length of 1 (done on rows 18-21). The following code (also available on Github here) yielded a result accepted by Rosalind:

import sys
from Bio import Phylo
import io

#open file and parse data
f = open('rosalind_nwck.txt','r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

#for each pair:
#-parse data further with biopython
#-add branch length 1 to all branches
#-use bioputhons Phylo distance funktion to get distances
#-print result on requested format

for i, line in pairs:
    x,y = line.split()
    tree = Phylo.read(io.StringIO(i),'newick')
    clades = tree.find_clades()
    for clade in clades:
        clade.branch_length = 1
    sys.stdout.write('%s' % tree.distance(x,y) + ' ')
sys.stdout.write('\n')

1 comment: