Wednesday 30 November 2016

Matching a Spectrum to a Protein

In this problem we are given a set of n protein strings and a multiset of weights corresponding to the complete spectrum of some unknown protein string. We are then asked to find the maximum multiplicity and the protein string that corresponds to it. My solution to the problem can be found on my GitHub. For each protein string, it calculates the weight of each prefix and suffix and checks if they correspond to any of the given weights of the protein spectrum. Each time they do a counter is increased by one for that specific protein string. Finally, the highest value is printed together with the corresponding string.

Sample Dataset
4
GSDMQS
VWICN
IASWMQS
PVSMGAD
445.17838
115.02694
186.07931
314.13789
317.1198
215.09061

Sample Output
3
IASWMQS

I actually didn't manage to replicate the result above, which is the example given in the problem description. However, when I tried my program with a larger dataset and uploaded my answer to Rosalind, I got the correct result, so I must have done something right!

Thursday 24 November 2016

Global Alignment with Scoring Matrix

In this problem we are asked to find the alignment score of two protein sequences using the scoring matrix BLOSUM62 and a linear gap penalty of 5. I decided to use Biopython's pairwise2 for this, as you can see in the code below and on my GitHub.

from Bio import pairwise2
from Bio import SeqIO
from Bio.SubsMat.MatrixInfo import blosum62

seqs = []
with open('rosalind_glob.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        seqs.append(record.seq)
s = seqs[0]
t = seqs[1]
alignments = pairwise2.align.globalds(s, t, blosum62, -5, -5)
print(pairwise2.format_alignment(*alignments[0]))

Note that the gap open penalty and the gap extension penalty are both set to -5, as we are told in the problem description to use a linear gap penalty of 5. 

Thursday 17 November 2016

Creating a Character Table from Genetic Strings

In this problem we are given a set of DNA strings and are asked to create a character table. My solution to the problem can be found here. I got a bit hung up on the problem to start with because in the description they give the following sample data and output:

Sample Dataset
ATGCTACC
CGTTTACC
ATTCGACC
AGTCTCCC
CGTCTATC

Sample Output
10110
10100

It took me a while to realise that the output is only partial, and what they are really after is:

10110
10100
10000
10111
11011
11101

Which is a character array for each non trivial position in the sequences. When I finally had this figured, the programming itself didn't take too long.

Wednesday 16 November 2016

Counting Disease Carriers

Once again we are looking at probability, this time regarding the occurrence of alleles in a population following the Hardy-Weinberg principle. For a population in genetic equilibrium for some given alleles, we are given the frequency of homozygous recessive individuals for each one and are asked to return the probabilities of any randomly selected individual having at least one recessive allele.

We can divide the population into three groups:

A - Homozygous recessive
B - Heterozygous
C - Homozygous dominant

We know that A + B + C = 1. The probability of picking individuals who are not homozygous dominants is P = 1 - C = A + B, which is what we are after.

If we denote the probability of a chromosome having a recessive allele q, and the probability of a chromosome having a dominant allele p, then since each individual carries two chromosomes a homozygous recessive can be described as A = q^2, a heterozygous individual is B = 2pq, and a homozygous dominant is C = p^2.

If we put this together we get that P = q^2 + 2pq. We also know that p + q = 1. So p = 1-q, which gives us P=q^2 + 2(1-q)q which in turn can be simplified to P = 2A^0.5-A.

Now we have our formula, so all we need to do is write a program that makes the calculation for all the given alleles. You can find my version here or below.

A = []
with open('rosalind_afrq.txt','r') as f:
    for nr in f.readline().split(' '):
        A.append(float(nr))

for i in A:
    print(round(2*i**0.5-i,3),end=' ')

Tuesday 15 November 2016

Newick Format with Edge Weights

If you have already solved "Distances in Trees", then this problem will probably not take that long to complete. When I solved distances in trees, I used the Biopython module Phylo. The problem I had then was that the given trees did not contain any branch lengths, so I had to add this in the program. However in this problem the branch lengths are already included in the trees, so the new program is in fact even shorter than the original. You can find my new version of the code below and on my GitHub.

import sys
from Bio import Phylo
import io

f = open('rosalind_nkew.txt''r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

for i, line in pairs:
    x, y = line.split()
    tree = Phylo.read(io.StringIO(i), 'newick')
    sys.stdout.write('%s' % round(tree.distance(x,y)) + ' ')
sys.stdout.write('\n')

Tuesday 8 November 2016

Edit Distance Alignment

This problem builds on a previous one called Edit Distance, for which you can find a solution here. In that problem we were asked to find the minimum number of changes needed to transform a string s into a string t. In this problem, we are asked once again to find this number (the edit distance), but this time we should also print the two augmented strings representing an optimal alignment of s and t.

To do this I went back to my solution of the previous problem and added some code to pick out one of the possible optimal alignments from the matrix formed by the existing program. The code is a bit messy, but for those who are interested it it is available on my GitHub.

Monday 7 November 2016

Constructing a De Bruijn Graph

In this problem we are asked to find the  adjacency list corresponding to the De Bruijn graph constructed from a set of short reads and their reverse complements. To do this was actually a lot easier than I initially thought. After a bit of research on De Bruijn graphs and a bit of doodling, I figured out the following three steps to complete the task:


  1. Make the reverse complements and combine with the initial reads.
  2. Add all unique reads (whether they are reverse complements or the original reads) to a set.
  3. For each of the reads in the set, print the prefix followed by the suffix.
I set about to write a program that performs these three steps and the following is what I managed to come up with:


from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def rev_comp(read):
    seq = Seq(read, IUPAC.unambiguous_dna)
    rev_seq = seq.reverse_complement()
    return(str(rev_seq))

def mk_set(seqs):
    S = set()
    for i in seqs:
        if i not in S:
            S.add(i)
        C = rev_comp(i)
        if C not in S:
            S.add(C)
    return(S)

data = []
with open('rosalind_dbru.txt','r') as f:
    for line in f:
        data.append(line.strip())

s = mk_set(data)
for j in s:
    print("(" + j[:-1] + ", " + j[1:] + ")")

The code can also be found on my GitHub along with the sample data and problem description.

Thursday 13 October 2016

Creating a Character Table (Finished!)

The approach I wrote about in my last post turned out to be a lot harder to implement than I initially thought, and I actually never made it past the step of splitting the tree into the possible sub-trees. After bashing my head against the problem a bit too long, I decided to try a different approach, using Biopython. The Phylo module of Biopython has a method for parsing trees in Newick format, and by using this I was also able to extract internal and external nodes of the tree in a much easier way than in my initial approach. The final version of my code can be found here on my GitHub, along with the problem description and some sample data files.

Wednesday 12 October 2016

Creating a Character Table

Last week I finished going through the interactive edition of the book "How to Think Like a Computer Scientist". Even though I knew quite a bit of what was mentioned in the book, I learned quite a bit about Python programming and computer science in general. I would definitely recommend people with a similar background as me (coming from the biological sciences) to have a go at this book, and I especially found the parts on classes and objects useful.

When I was finished with the book I went back to working through the remaining Rosalind problems. I am currently working on "Creating a Character Table". It took me a while just to understand what they are after in this problem, but I finally figured it out and I have been able to make some progress on my program.

So, what they are after is the so called character table of an unrooted binary tree in Newick format. This table is formed of rows representing the character arrays of the tree. A character array is formed as follows:


  1. Remove one edge of the tree so that two trees are formed (the new trees can't consist of only one taxon, i.e. the split must be nontrivial).
  2. Assign one tree as the 0 tree (representing its taxons not having a specific trait), and the other tree as the 1 tree (representing its taxons having a specific trait). This is done arbitrarily. 
  3. Sort the taxons lexicographically and print their corresponding number (0 or 1).
The following code is the beginning of my solution to the problem. So far it opens and reads the data file. It then iterates over the tree (in string format) and finds where the splits are, i.e. where all '),' are located in the tree. I have also started writing code for a class that can couple the name of a taxon with its assigned value and then sort these lexicograpically.

class tree_index:
    def __init__(self, n, v):
        self.name = n
        self.value = v

def read_tree(s):
    for i in range(len(s)-1):
        if  s[i] + s[i+1] == '),':
            count = 0
            for j in range(len(s)):
                if s[j] == '(':
                    count += 1
                if s[j] == ')':
                    count -= 1
                print(count)


with open('sampledata.txt', 'r') as f:
    tree = f.read().strip()

read_tree(tree)

What I want to do next is to find a way to split the tree up into two trees each time '),' appears. I then want to couple the names of the taxons to the number of the tree they belong to by using the class tree_index. If I can then add a sort functionality to the class I think the problem would be solved. I am currently working on this and will return here when I manage to find a solution.

Until then!

Friday 23 September 2016

Learning computer science

So far I have managed to solve 52 problems on Rosalind, and I feel like I have become a lot better at programming since I started. But recently the problems on Rosalind have become more and more complex, and I have sometimes had a hard time solving them or even understanding the solutions others have come up with. The main reason for this is my lack of a background in computer science. So I have decided to make up for this, starting with studying the basics of computer science.

I wanted to learn the basics of computer science, but I also wanted the focus to be on Python, which lead me to the interactive book "How to Think Like a Computer Scientist". Yesterday I completed chapter 1, and today I will continue with chapter 2. So far it's pretty basic, but even so I have already learned quite a bit! When I'm finished I will try to find a more formal online course to take.

Monday 19 September 2016

Comparing Spectra with the Spectral Convolution

In this task we are given two multisets, S1 and S2, each representing simplified spectra taken from two peptides. We are asked to find the largest multiplicity of the  Minkowski difference (S1⊖S2) and the absolute value of the number x maximizing (S1⊖S2)(x). It took me a while to figure out what I was meant to find, but ultimately I figured that it was the number that occurs most frequently in the multiset formed by S1⊖S2, and its frequency.

Sample Dataset
186.07931 287.12699 548.20532 580.18077 681.22845 706.27446 782.27613 968.35544 968.35544
101.04768 158.06914 202.09536 318.09979 419.14747 463.17369

Expected Output
3
85.03163

Once I had realized what I was after, The programming itself became very easy. All I had to do was generate the multiset formed by S1⊖S2 and pick out the most frequently occurring number. As it turns out, there is a really useful built in library for this called collections. Using the function called most_common, I got precisely what I wanted. Have a look at the code below or on my GitHub.

from collections import Counter

data = []
with open('rosalind_conv.txt','r') as f:
    for line in f:
        data.append(line.strip())
S1 = [float(x) for x in data[0].split()]
S2 = [float(x) for x in data[1].split()]

#calculate Minkowski difference of S1 and S2
min_diff = []
for i in S1:
    for j in S2:
        diff = round(i - j, 5)
        min_diff.append(diff)

#find most common value and its frequency
x = Counter(min_diff).most_common(1)

print(x[0][1])
print(x[0][0])

Thursday 15 September 2016

Inferring Protein from Spectrum

In this problem we are given a prefix spectrum and are asked to infer a protein sequence from it using the monoisotopic mass table.

Sample Dataset
3524.8542
3710.9335
3841.974
3970.0326
4057.0646

Expected Output
WMQS

This was a fairly simple problem to solve, and I guess the tricky part was to get the rounding right. In order to compare the results from the numbers in the dataset with the numbers in the mass table, we need to round all numbers to four decimals, otherwise they will differ slightly. The code below can also be found on my Github.

L = [float(line) for line in open('rosalind_spec.txt','r')]

mass_table = {'A':71.03711,'C':103.00919,'D':115.02694,'E':129.04259,'F':147.06841,'G':57.02146,'H':137.05891,'I':113.08406,'K':128.09496,'L':113.08406,'M':131.04049,'N':114.04293,'P':97.05276,'Q':128.05858,'R':156.10111,'S':87.03203,'T':101.04768,'V':99.06841,'W':186.07931,'Y':163.06333}

aa_masses = []
for i in range(len(L) - 1):
    aa_mass = round(L[i + 1] - L[i], 4)
    aa_masses.append(aa_mass)

rnd_mass_table = {}
for k, v in mass_table.items():
    rnd_mass_table[round(v, 4)] = k

prot = ''
for aa in aa_masses:
    prot += rnd_mass_table[aa]

print(prot)

Interleaving Two Motifs

This time we are asked to find the shortest common supersequence of two given sequences. The shortest common supersequence is the shortest sequence that contains both of the given sequences as subsequences.

Sample Dataset
ATCTGAT
TGCATA

Expected Output
ATGCATGAT

If multiple solutions exist we are told to return any of them. To solve this problem, the easiest thing to do is to pick out the longest common subsequence (lcs) of the two given sequences and then add in the nucleotides that are missing from each sequence. To find the lcs, I reused most of the code I wrote for the problem Finding a Shared Spliced Motif. The resulting code can be found on my Github as well as below:

def lcs(s,t):
    lengths = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
    for i, x in enumerate(s):
        for j, y in enumerate(t):
            if x == y:
                lengths[i + 1][j + 1] = lengths[i][j] + 1
            else:
                lengths[i + 1][j + 1] = max(lengths[i + 1][j], lengths[i][j + 1])
    spliced_motif = ''
    x, y = len(s), len(t)
    while x * y != 0:
        if lengths[x][y] == lengths[x - 1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y - 1]:
            y -= 1
        else:
            spliced_motif = s[x - 1] + spliced_motif
            x -= 1
            y -= 1
    return(spliced_motif)


def scs(s,t):
    subseq = lcs(s,t)
    superseq = ''
    i, j = 0, 0
    for nt in subseq:
        if i < len(s):
            while s[i] != nt:
                superseq += s[i]
                i += 1
            i += 1
        if j < len(t):
            while t[j] != nt:
                superseq += t[j]
                j += 1
            j += 1
        superseq += nt
    if i < len(s):
        superseq += s[i:]
    if j < len(t):
        superseq += t[j:]
    return(superseq)

s, t = [line.strip() for line in open('rosalind_scsp.txt','r')]
print(scs(s,t))

Wednesday 14 September 2016

Distances in Trees

In this problem we are looking att the Newick format and how to find the distance between two nodes in a phylogenetic tree. We are given a file containing trees in Newick format and two nodes for each tree, and are asked to find the distance between those nodes.

Sample Dataset
(cat)dog;
dog cat

(dog,cat);
dog cat

Expected Output
1 2

I remember working with the Newick format before, in one of the bioinformatics courses I took, so when I started working on this problem I recalled that there were functions for the format available in Biopython. So I had a look at the documentation, and sure enough, there is a function called distance that would be suitable. However, as always, there was a slight problem. The trees given by Rosalind did not contain any branch lengths, which is what the distance function uses to calculate the distance between two nodes. To enable using this function I therefor had to assign the branches a length of 1 (done on rows 18-21). The following code (also available on Github here) yielded a result accepted by Rosalind:

import sys
from Bio import Phylo
import io

#open file and parse data
f = open('rosalind_nwck.txt','r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

#for each pair:
#-parse data further with biopython
#-add branch length 1 to all branches
#-use bioputhons Phylo distance funktion to get distances
#-print result on requested format

for i, line in pairs:
    x,y = line.split()
    tree = Phylo.read(io.StringIO(i),'newick')
    clades = tree.find_clades()
    for clade in clades:
        clade.branch_length = 1
    sys.stdout.write('%s' % tree.distance(x,y) + ' ')
sys.stdout.write('\n')

Thursday 8 September 2016

Introduction to Set Operations

In this problem we are introduced to set operations. After finishing a course in algebra and geometry almost six years ago, I thought that I'd never have to work with set theory again. Apparently I was wrong. Fortunately, at least in this problem, it turned out to be much easier than I remembered.

For a set A and B, that are both subsets of U (where U is the set {1, 2, ..., n} for a given n), we are asked to find the following six sets: A∪B, A∩B, A−B, B−A, the set complement of A and the set complement of B (with respect to U).

Sample Dataset
10
{1, 2, 3, 4, 5}
{2, 8, 5, 10}

Expected Output
{1, 2, 3, 4, 5, 8, 10}
{2, 5}
{1, 3, 4}
{8, 10}
{8, 9, 10, 6, 7}
{1, 3, 4, 6, 7, 9}

As it turns out, Python has built in set operators, which made this problem a lot easier to solve. The following code is what I came up with. You can also find it on my GitHub.

#read file and format data into sets of integers
data = []
with open('rosalind_seto.txt','r') as f:
    for line in f:
        data.append(line.strip('\n'))
U = set(range(1, int(data[0])+1))
A = set([int(x) for x in data[1].strip('{}').split(',')])
Bset([int(x) for x in data[2].strip('{}').split(',')])

#perform set operations
union = A.union(B)
intersection = set(A).intersection(B)
diff_AB = A-B
diff_BA = B-A
A_comp = U-A
B_comp = U-B

#print answers
print(union)
print(intersection)
print(diff_AB)
print(diff_BA)
print(A_comp)
print(B_comp)

Wednesday 7 September 2016

Learning Git

Learning Git is something that has been on my to-do-list for quite a while now, and I have finally gotten around to it. I have been following this tutorial, which includes some basic commands and also some more advanced stuff towards the end. I also created an account on GitHub where I will start adding my solutions to the Rosalind problems.

To try it out, I made a repository and pushed my solution to "Introduction to Alternative Splicing" into it. And immediately ran into a problem. I got the following error message:

! [rejected]        master -> master (fetch first)
error: failed to push some refs to 'https://github.com/ssjunnebo/Rosalind.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details

A quick bit of Google-ing told me that the problem was that when I created the repository on GitHub, I added a README, and since this wasn't available locally on my computer I first needed to merge the GitHub repository with the folder on my computer. I did this using the following command:

> git pull origin master

 I was then able to push my first piece of code onto GitHub!



Thursday 25 August 2016

Expected Number of Restriction Sites

This problem is very similar to "Introduction to Random Strings". As in that problem, we are given a string, s, and an array, A, containing some GC contents. However, in this case we are also given an integer, n, representing the length of a second string, t. t is the random string formed with each given GC content and we are asked to find the probability of finding s as a substring of each t. What we need to realise to solve this problem is that the number of opportunities for finding s in t is equal to n-len(s)+1. To get the overall probability of finding s in t, we simply need to add all the individual probabilities of randomly forming s with the specified GC contents.

Sample Dataset
10
AG
0.25 0.5 0.75

Expected Output
0.422 0.563 0.422

The following is the code I wrote to solve the problem. When I wrote it I became aware that there is a difference in how Python3 and Python2 handles rounding of 0.5. In Python2 it is rounded up, but in Python3 it is rounded down. On an earlier problem I had to run my program using Python2 because the answer I got from Python3 wasn't accepted by Rosalind. In this case however, the sample data set I got yielded the same answer regardless of which version I used (suggesting it no cases of having to round 0.5 occurred in this data set). (Note: if the following code is to be run with Python2 the formating of the output needs to be rewritten for it to work).

data = []
with open('rosalind_eval.txt', 'r') as f:
    for line in f:
        data.append(line.strip('\n'))
n = int(data[0])
s = data[1]
A = [float(x) for x in data[2].split()]

AT, GC = 0, 0
for nt in s:
    if nt == 'A' or nt == 'T':
        AT += 1
    elif nt == 'G' or nt == 'C':
        GC += 1

B = [None]*len(A)
for i, j in enumerate(A):
    P = (((1 - j)/2)**AT)*((j/2)**GC)*(n - len(s)+1)
    B[i] = '%0.3f' % P
print(*B, sep=' ')

Wednesday 24 August 2016

Edit Distance

In this problem we are looking for the Levenshtein distance between two protein sequences. The distance is calculated as the number of single character changes (insertion, deletion or substitution) needed to turn one sequence into the other. One way to do this is described in the Wikipedia page and involves building a matrix of distances between the prefixes of the two strings. The last value to be calculated (in the bottom right position of the matrix) then equals the distance between the two strings.

Sample Dataset
>Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLY

Expected Output
5

The following python3 code initiates a matrix M with len(s) rows and len(t) columns. It then fills the matrix based on if and what type of change is needed between the two strings. The last value is then printed as the answer.

from Bio import SeqIO
seqs = []
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        seqs.append(str(record.seq))
s = seqs[0]
t = seqs[1]
M = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
for i in range(1, len(s) + 1):
    M[i][0] = i
for i in range(1, len(t) + 1):
    M[0][i] = i

for i in range(1, len(s) + 1):
    for j in range(1, len(t) + 1):
        if s[i - 1] == t[j - 1]:
            M[i][j] = M[i - 1][j - 1]
        else:
            M[i][j] = min(M[i - 1][j] + 1, M[i][j - 1] + 1,
                          M[i - 1][j - 1] + 1)
print(M[len(s)][len(t)])

Introduction to Alternative Splicing

In this problem we continue looking at subsets. This time we imagine them as exons and want to know how many different combinations we can form from subsets of the set. We are given two integers n and m, which are the length of the set ([1,2,3,4,5,6] in the sample case) and the minimum length of the subsets to be formed. The order of the exons cannot be altered, so for n=3 and m=2 we would get the four subsets [1,2], [1,3], [2,3] and [1,2,3]. The answer should be given modulo 1000000.

Sample Dataset
6 3

Expected Output
42

The problem can quite easily be solved by counting all the different combinations of each length from m to  n, and then summarising them to form the final answer. The number of combinations can be calculated as n!/(k!(n-k)!), where k is the length of the subset. In python3 code this can look like:

from math import factorial
n, m = 6, 3
subsets = 0
for k in range(m, n + 1):
    subsets += (factorial(n) // (factorial(k) * factorial(n - k)))
print(subsets % 1000000)

At first I used / instead of //, which worked fine for the sample dataset, but when I downloaded another dataset the numbers were bigger and therefore they got too big for float division and I got the error message "OverflowError: integer division result too large for a float". A quick Google search took me to this thread on Stack Overflow that solved my problem.

Tuesday 23 August 2016

Counting Subsets

In this problem we are introduced to set theory (causing flashbacks from the linear algebra course...). For a set {1,2,3,...,n}, we are asked to return the number of subsets of the set. What we need to remember is that the empty set {} and the whole set {1,2,3,...,n} are also subsets (aptly explained here). So, to find the total number of subsets, all we need to do is calculate 2^n. Since this number quickly gets very large, the problem states that we should return the answer modulo 1 000 000.

Sample Dataset
3

Expected Output
8

The following program is probably the shortest one yet:

n = 3                      
print(2**n % 1000000)      

Or, if you like, a fancier version where the user can input n in the terminal:

n = int(input('Enter n: '))
print(2**n % 1000000)      

Monday 22 August 2016

Matching Random Motifs

This problem is very similar to “Introduction to Random Strings”. In this case however, we are given a sting s, and are asked to calculate the probability of randomly forming s when generating N nr of stings of equal length of s and a GC content of x.

Sample Dataset
90000 0.6
ATAGCCGA

Sample Output
0.689

To calculate the probability of randomly getting a string that equals s when constructing it with GC content x, we can use the same equation as in introduction to random strings. To go from this probability (lets call it s_prob) to the probability of getting at least one string that is equal to s we use the following equation:

P(at least 1 match of s) = 1 − P(no matches out of N strings) = 1 − [1 - s_prob]^N

The following program makes the above calculations and outputs the answer with three significant figures:

N = 90000
x = 0.6
s = 'ATAGCCGA'
AT = 0
GC = 0
for nt in s:
    if nt == 'A' or nt == 'T':
        AT += 1
    elif nt == 'G' or nt == 'C':
        GC += 1

s_prob = (((1 - x) / 2)**AT) * (((x) / 2)**GC)
prob = 1 - (1 - s_prob)**N
print('%0.3f' % prob)

Wednesday 17 August 2016

Creating a Distance Matrix

This problem was fairly similar to "Counting Point Mutations" and "Error Correction in Reads". We are given a set of DNA strings and are asked to return the distance matrix of the strings. The distance between two given strings can be calculated by dividing the Hamming distance (i.e. the number of nucleotides that differ between the stings) with the length of the sequences (given that all sequences are of equal length). These values should then be printed on matrix form with 5 significant figures.

Sample Dataset
>Rosalind_9499
TTTCCATTTA
>Rosalind_0942
GATTCATTTC
>Rosalind_6568
TTTCCATTTT
>Rosalind_1833
GTTCCATTTA

Expected Output
0.00000 0.40000 0.10000 0.10000
0.40000 0.00000 0.40000 0.30000
0.10000 0.40000 0.00000 0.20000
0.10000 0.30000 0.20000 0.00000

To write this program I reused parts of the code from "Counting Point Mutations" and "Error Correction in Reads" and modified it to suit this problem. The following is the final code, which took me only 20 minutes to write. Hurray!

from Bio import SeqIO
reads = []
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        reads.append(str(record.seq))

read_len = len(reads[0])
for curr_read in reads:
    distance = []
    for comp_read in reads:
        hamming = 0
        for nt1, nt2 in zip(curr_read, comp_read):
            if nt1 != nt2:
                hamming += 1
        distance.append(str.format('{0:.5f}', hamming / read_len))
    print(*distance, sep=' ')

Tuesday 16 August 2016

Maximum Matchings and RNA Secondary Structures

This problem is very similar to the problem presented in "Perfect Matchings and RNA Secondary Structures". The difference is that in this problem we don't have the same number of A as U and G as C, so we don't get any perfect matchings. Instead we are given the task to find the total possible number of maximum matchings for a given RNA string.

Sample Dataset
>Rosalind_92
AUGCUUC

Expected Output
6

As in "Perfect Matchings and RNA Secondary Structures" the programming is very simple and it all comes down to the maths. We can simplify the problem by looking at AU and GC separately, and can describe the number of maximum matches for each part as max(A,U)!/(max(A,U)-min(A,U))! and max(G,C)!/(max(G,C)-min(G,C))!, respectively. To get the total number of maximum matches, simply multiply the answers.

For the first dataset I tried, I ran the program using Python 3.5 and got the result 434412400153932848464251158517855537428692992. This was deemed incorrect by Rosalind. After seeing a tip online, I ran the program using Python 2.7, and got 434412400153932864602133769790674698240000000 instead. So I tried my program with a new dataset and ran it with Python 2.7 and this time my answer was accepted.

Here follows the code I wrote to solve this problem (works in both 2.7 and 3.5):

from Bio import SeqIO
from math import factorial
sequence = ''
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        sequence = str(record.seq)

A, U, G, C = 0, 0, 0, 0
for nt in sequence:
    if nt == 'A':
        A += 1
    elif nt == 'U':
        U += 1
    elif nt == 'G':
        G += 1
    elif nt == 'C':
        C += 1

AU = factorial(max(A, U)) / factorial(max(A, U) - min(A, U))
GC = factorial(max(G, C)) / factorial(max(G, C) - min(G, C))
print(int(AU * GC))

Monday 15 August 2016

Ordering Strings of Varying Length Lexicographically

In this problem we are given a string representing an alphabet. We are also given an integer n ≤ 4. We are to return a list of all strings of length at most n that can be formed from the alphabet string. Repeats should be included and the strings should be sorted lexicographically according to the given alphabet (and this time they really should, as opposed to in “Enumerating k-mers Lexicographically”).

Sample Dataset
D N A
3

Expected Output
D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA

To begin with this problem is fairly similar to “Enumerating k-mers Lexicographically”, so I started by having a look at what I wrote there. As in that problem, I decided to use itertools, but this time, since the answer should include repeats, I used product instead. I also put the creation of the permutations into a for-loop in order to get all the different lengths, as you can see below. This generated a nested list, which I then flattened into a single list using the itertools function chain(). Then all that was left to do was to sort and print the permutations to a file. In order to sort them, I made use of the function sorted() and used a lambda function as key. Here is an explanation of the lambda function in Python. The final code can be seen below:

import itertools
n = 3
alphabet = 'DNA'
perm = []
for i in range(1, n + 1):
    perm.append(list(map(''.join, (itertools.product(alphabet, repeat=i)))))
permutations = list(itertools.chain(*perm))
srt_perm = sorted(permutations,
                  key=lambda word: [alphabet.index(c) for c in word])
with open('answer.txt', 'a') as f:
    for j in srt_perm:
        f.write('%s\n' % j)

Tuesday 9 August 2016

Finding a Shared Spliced Motif

In this problem we are asked to find the longest common subsequence of two DNA-sequences, s and t. As opposed to a substring, a subsequence does not need to occur contiguously in s and t. To solve this problem it might be a good idea to have a look at the Wikipedia page on the longest common subsequence problem.

Sample Dataset
>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA

Expected Output
AACTGG (or any other if there are multiple subsequences of the same length)

The following is a solution to the problem using dynamic programming:

from Bio import SeqIO
sequences = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
    sequences.append(str(record.seq))
s = sequences[0]
t = sequences[1]

lengths = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
#creates array of len(s) containing arrays of len(t) filled with 0
for i, x in enumerate(s):
    for j, y in enumerate(t):
        if x == y:
            lengths[i + 1][j + 1] = lengths[i][j] + 1
        else:
            lengths[i + 1][j + 1] = max(lengths[i + 1][j], lengths[i][j + 1])

spliced_motif = ''
x, y = len(s), len(t)
while x * y != 0:
    if lengths[x][y] == lengths[x - 1][y]:
        x -= 1
    elif lengths[x][y] == lengths[x][y - 1]:
        y -= 1
    else:
        spliced_motif = s[x - 1] + spliced_motif
        x -= 1
        y -= 1
print(spliced_motif)

Monday 8 August 2016

Speeding Up Motif Finding

In this problem we are looking at how to speed up motif finding using the Knuth-Morris-Pratt algorithm. A good explanation of the algorithm can be found here. We are given a DNA-string and are expected to return the failure array for the sting.

Sample Dataset
>Rosalind_87
CAGCATGGTATCACAGCAGAG

Expected Output
0 0 0 1 2 0 0 0 0 0 0 1 2 1 2 3 4 5 3 0 0

The following code is what I ended up with:

from Bio import SeqIO
record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = list(record.seq)

F_array = [0] * len(sequence)
k = 0
for i in range(2, len(sequence) + 1):
    while k > 0 and sequence[k] != sequence[i - 1]:
        k = F_array[k - 1]
    if sequence[k] == sequence[i - 1]:
        k += 1
    F_array[i - 1] = k

with open('array.txt', 'w') as answer:
    answer.write(' '.join(map(str, F_array)))

Friday 5 August 2016

Code School - Try R

I first encountered R during one of my master's courses at KTH. The course was in bioinformatics and biostatistics and contained a practical part during which we got to do some statistical analysis and data visualization using R. It has been a while since I last used R, so I felt that a recap of the language would be a good idea. 

The top hit I got on Google was the course Try R from Code School. The course is very basic and would suit someone who is a complete beginner, or, like me, just wants a quick recap. The commands are entered in the browser and there is no need to install R on your computer for this course. It goes through the basic commands in R, different data structures like vectors and matrices, visualisation and some statistics. 

I worked through this tutorial pretty quickly and a lot of what I had learned previously came back to me. I installed the latest version of R on my computer with a little help of this site and will continue with some more advanced tutorials now.

Thursday 4 August 2016

k-Mer Composition

In this problem we are given a DNA-string for which we are to return a matrix consisting of the frequencies of all possible 4-mers in the string, ordered alphabetically.

Sample Dataset
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGGCCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGTTTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCAAATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCGGGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGACTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG

Expected Output
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1

Regardless of the given sequence, the possible 4-mers are always the same and we can generate them with the following code:

import itertools
nt = 'ACGT'        #Use this order of nt to get correct order later without sorting
permutations = itertools.product(nt, repeat=4)

kmers = []
for i, j in enumerate(list(permutations)):
    kmer = ''
    for item in j:
        kmer += str(item)
    kmers.append(kmer)

This gives us a list of all the possible 4-mers in alphabetical order (note that the funktion sorts the permutations in the order the letters are listed in. If nt is not entered alphabetically you will need to sort kmers).

Now we can extrakt the sequence from the FASTA file and use regex to find all the occurrences of the k-mers. Remember to use ?= in the pattern to include overlapping k-mers:

import re
from Bio import SeqIO

record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = record.seq

A = []
for k in kmers:
    occurence = 0
    pattern = re.compile(r'(?=(' + k + '))')
    for l in re.findall(pattern, str(sequence)):
        occurence += 1
    A.append(occurence)
print(*A, sep=' ')