Thursday, 13 October 2016

Creating a Character Table (Finished!)

The approach I wrote about in my last post turned out to be a lot harder to implement than I initially thought, and I actually never made it past the step of splitting the tree into the possible sub-trees. After bashing my head against the problem a bit too long, I decided to try a different approach, using Biopython. The Phylo module of Biopython has a method for parsing trees in Newick format, and by using this I was also able to extract internal and external nodes of the tree in a much easier way than in my initial approach. The final version of my code can be found here on my GitHub, along with the problem description and some sample data files.

Wednesday, 12 October 2016

Creating a Character Table

Last week I finished going through the interactive edition of the book "How to Think Like a Computer Scientist". Even though I knew quite a bit of what was mentioned in the book, I learned quite a bit about Python programming and computer science in general. I would definitely recommend people with a similar background as me (coming from the biological sciences) to have a go at this book, and I especially found the parts on classes and objects useful.

When I was finished with the book I went back to working through the remaining Rosalind problems. I am currently working on "Creating a Character Table". It took me a while just to understand what they are after in this problem, but I finally figured it out and I have been able to make some progress on my program.

So, what they are after is the so called character table of an unrooted binary tree in Newick format. This table is formed of rows representing the character arrays of the tree. A character array is formed as follows:


  1. Remove one edge of the tree so that two trees are formed (the new trees can't consist of only one taxon, i.e. the split must be nontrivial).
  2. Assign one tree as the 0 tree (representing its taxons not having a specific trait), and the other tree as the 1 tree (representing its taxons having a specific trait). This is done arbitrarily. 
  3. Sort the taxons lexicographically and print their corresponding number (0 or 1).
The following code is the beginning of my solution to the problem. So far it opens and reads the data file. It then iterates over the tree (in string format) and finds where the splits are, i.e. where all '),' are located in the tree. I have also started writing code for a class that can couple the name of a taxon with its assigned value and then sort these lexicograpically.

class tree_index:
    def __init__(self, n, v):
        self.name = n
        self.value = v

def read_tree(s):
    for i in range(len(s)-1):
        if  s[i] + s[i+1] == '),':
            count = 0
            for j in range(len(s)):
                if s[j] == '(':
                    count += 1
                if s[j] == ')':
                    count -= 1
                print(count)


with open('sampledata.txt', 'r') as f:
    tree = f.read().strip()

read_tree(tree)

What I want to do next is to find a way to split the tree up into two trees each time '),' appears. I then want to couple the names of the taxons to the number of the tree they belong to by using the class tree_index. If I can then add a sort functionality to the class I think the problem would be solved. I am currently working on this and will return here when I manage to find a solution.

Until then!