Sara does Bioinformatics: September 2016

Friday 23 September 2016

Learning computer science

So far I have managed to solve 52 problems on Rosalind, and I feel like I have become a lot better at programming since I started. But recently the problems on Rosalind have become more and more complex, and I have sometimes had a hard time solving them or even understanding the solutions others have come up with. The main reason for this is my lack of a background in computer science. So I have decided to make up for this, starting with studying the basics of computer science.

I wanted to learn the basics of computer science, but I also wanted the focus to be on Python, which lead me to the interactive book "How to Think Like a Computer Scientist". Yesterday I completed chapter 1, and today I will continue with chapter 2. So far it's pretty basic, but even so I have already learned quite a bit! When I'm finished I will try to find a more formal online course to take.

Monday 19 September 2016

Comparing Spectra with the Spectral Convolution

In this task we are given two multisets, S1 and S2, each representing simplified spectra taken from two peptides. We are asked to find the largest multiplicity of the Minkowski difference (S1⊖S2) and the absolute value of the number x maximizing (S1⊖S2)(x). It took me a while to figure out what I was meant to find, but ultimately I figured that it was the number that occurs most frequently in the multiset formed by S1⊖S2, and its frequency.

Sample Dataset
186.07931 287.12699 548.20532 580.18077 681.22845 706.27446 782.27613 968.35544 968.35544
101.04768 158.06914 202.09536 318.09979 419.14747 463.17369

Expected Output
3
85.03163

Once I had realized what I was after, The programming itself became very easy. All I had to do was generate the multiset formed by S1⊖S2 and pick out the most frequently occurring number. As it turns out, there is a really useful built in library for this called collections. Using the function called most_common, I got precisely what I wanted. Have a look at the code below or on my GitHub.

from collections import Counter

data = []
with open('rosalind_conv.txt','r') as f:
for line in f:
data.append(line.strip())
S1 = [float(x) for x in data[0].split()]
S2 = [float(x) for x in data[1].split()]

#calculate Minkowski difference of S1 and S2
min_diff = []
for i in S1:
for j in S2:
diff = round(i - j, 5)
min_diff.append(diff)

#find most common value and its frequency
x = Counter(min_diff).most_common(1)

print(x[0][1])
print(x[0][0])

Thursday 15 September 2016

Inferring Protein from Spectrum

In this problem we are given a prefix spectrum and are asked to infer a protein sequence from it using the monoisotopic mass table.

Sample Dataset
3524.8542
3710.9335
3841.974
3970.0326
4057.0646

Expected Output
WMQS

This was a fairly simple problem to solve, and I guess the tricky part was to get the rounding right. In order to compare the results from the numbers in the dataset with the numbers in the mass table, we need to round all numbers to four decimals, otherwise they will differ slightly. The code below can also be found on my Github.

L = [float(line) for line in open('rosalind_spec.txt','r')]

mass_table = {'A':71.03711,'C':103.00919,'D':115.02694,'E':129.04259,'F':147.06841,'G':57.02146,'H':137.05891,'I':113.08406,'K':128.09496,'L':113.08406,'M':131.04049,'N':114.04293,'P':97.05276,'Q':128.05858,'R':156.10111,'S':87.03203,'T':101.04768,'V':99.06841,'W':186.07931,'Y':163.06333}

aa_masses = []
for i in range(len(L) - 1):
aa_mass = round(L[i + 1] - L[i], 4)
aa_masses.append(aa_mass)

rnd_mass_table = {}
for k, v in mass_table.items():
rnd_mass_table[round(v, 4)] = k

prot = ''
for aa in aa_masses:
prot += rnd_mass_table[aa]

print(prot)

Interleaving Two Motifs

This time we are asked to find the shortest common supersequence of two given sequences. The shortest common supersequence is the shortest sequence that contains both of the given sequences as subsequences.

Sample Dataset
ATCTGAT
TGCATA

Expected Output
ATGCATGAT

If multiple solutions exist we are told to return any of them. To solve this problem, the easiest thing to do is to pick out the longest common subsequence (lcs) of the two given sequences and then add in the nucleotides that are missing from each sequence. To find the lcs, I reused most of the code I wrote for the problem Finding a Shared Spliced Motif. The resulting code can be found on my Github as well as below:

def lcs(s,t):
lengths = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
for i, x in enumerate(s):
for j, y in enumerate(t):
if x == y:
lengths[i + 1][j + 1] = lengths[i][j] + 1
else:
lengths[i + 1][j + 1] = max(lengths[i + 1][j], lengths[i][j + 1])
spliced_motif = ''
x, y = len(s), len(t)
while x * y != 0:
if lengths[x][y] == lengths[x - 1][y]:
x -= 1
elif lengths[x][y] == lengths[x][y - 1]:
y -= 1
else:
spliced_motif = s[x - 1] + spliced_motif
x -= 1
y -= 1
return(spliced_motif)

def scs(s,t):
subseq = lcs(s,t)
superseq = ''
i, j = 0, 0
for nt in subseq:
if i < len(s):
while s[i] != nt:
superseq += s[i]
i += 1
i += 1
if j < len(t):
while t[j] != nt:
superseq += t[j]
j += 1
j += 1
superseq += nt
if i < len(s):
superseq += s[i:]
if j < len(t):
superseq += t[j:]
return(superseq)

s, t = [line.strip() for line in open('rosalind_scsp.txt','r')]
print(scs(s,t))

Wednesday 14 September 2016

Distances in Trees

In this problem we are looking att the Newick format and how to find the distance between two nodes in a phylogenetic tree. We are given a file containing trees in Newick format and two nodes for each tree, and are asked to find the distance between those nodes.

Sample Dataset
(cat)dog;
dog cat

(dog,cat);
dog cat

Expected Output
1 2

I remember working with the Newick format before, in one of the bioinformatics courses I took, so when I started working on this problem I recalled that there were functions for the format available in Biopython. So I had a look at the documentation, and sure enough, there is a function called distance that would be suitable. However, as always, there was a slight problem. The trees given by Rosalind did not contain any branch lengths, which is what the distance function uses to calculate the distance between two nodes. To enable using this function I therefor had to assign the branches a length of 1 (done on rows 18-21). The following code (also available on Github here) yielded a result accepted by Rosalind:

import sys
from Bio import Phylo
import io

#open file and parse data
f = open('rosalind_nwck.txt','r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

#for each pair:
#-parse data further with biopython
#-add branch length 1 to all branches
#-use bioputhons Phylo distance funktion to get distances
#-print result on requested format

for i, line in pairs:
x,y = line.split()
tree = Phylo.read(io.StringIO(i),'newick')
clades = tree.find_clades()
for clade in clades:
clade.branch_length = 1
sys.stdout.write('%s' % tree.distance(x,y) + ' ')
sys.stdout.write('\n')

Thursday 8 September 2016

Introduction to Set Operations

In this problem we are introduced to set operations. After finishing a course in algebra and geometry almost six years ago, I thought that I'd never have to work with set theory again. Apparently I was wrong. Fortunately, at least in this problem, it turned out to be much easier than I remembered.

For a set A and B, that are both subsets of U (where U is the set {1, 2, ..., n} for a given n), we are asked to find the following six sets: A∪B, A∩B, A−B, B−A, the set complement of A and the set complement of B (with respect to U).

Sample Dataset
10
{1, 2, 3, 4, 5}
{2, 8, 5, 10}

Expected Output
{1, 2, 3, 4, 5, 8, 10}
{2, 5}
{1, 3, 4}
{8, 10}
{8, 9, 10, 6, 7}
{1, 3, 4, 6, 7, 9}

As it turns out, Python has built in set operators, which made this problem a lot easier to solve. The following code is what I came up with. You can also find it on my GitHub.

#read file and format data into sets of integers
data = []
with open('rosalind_seto.txt','r') as f:
for line in f:
data.append(line.strip('\n'))
U = set(range(1, int(data[0])+1))
A = set([int(x) for x in data[1].strip('{}').split(',')])
B = set([int(x) for x in data[2].strip('{}').split(',')])

#perform set operations
union = A.union(B)
intersection = set(A).intersection(B)
diff_AB = A-B
diff_BA = B-A
A_comp = U-A
B_comp = U-B

#print answers
print(union)
print(intersection)
print(diff_AB)
print(diff_BA)
print(A_comp)
print(B_comp)

A^{c}

Wednesday 7 September 2016

Learning Git

Learning Git is something that has been on my to-do-list for quite a while now, and I have finally gotten around to it. I have been following this tutorial, which includes some basic commands and also some more advanced stuff towards the end. I also created an account on GitHub where I will start adding my solutions to the Rosalind problems.

To try it out, I made a repository and pushed my solution to "Introduction to Alternative Splicing" into it. And immediately ran into a problem. I got the following error message:

! [rejected] master -> master (fetch first)
error: failed to push some refs to 'https://github.com/ssjunnebo/Rosalind.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details

A quick bit of Google-ing told me that the problem was that when I created the repository on GitHub, I added a README, and since this wasn't available locally on my computer I first needed to merge the GitHub repository with the folder on my computer. I did this using the following command:

> git pull origin master

I was then able to push my first piece of code onto GitHub!