Sara does Bioinformatics: August 2016

Thursday 25 August 2016

Wednesday 24 August 2016

Edit Distance

In this problem we are looking for the Levenshtein distance between two protein sequences. The distance is calculated as the number of single character changes (insertion, deletion or substitution) needed to turn one sequence into the other. One way to do this is described in the Wikipedia page and involves building a matrix of distances between the prefixes of the two strings. The last value to be calculated (in the bottom right position of the matrix) then equals the distance between the two strings.

Sample Dataset
>Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLY

Expected Output
5

The following python3 code initiates a matrix M with len(s) rows and len(t) columns. It then fills the matrix based on if and what type of change is needed between the two strings. The last value is then printed as the answer.

from Bio import SeqIO
seqs = []
with open('sampledata.fasta', 'r') as f:
for record in SeqIO.parse(f, 'fasta'):
seqs.append(str(record.seq))
s = seqs[0]
t = seqs[1]
M = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
for i in range(1, len(s) + 1):
M[i][0] = i
for i in range(1, len(t) + 1):
M[0][i] = i

for i in range(1, len(s) + 1):
for j in range(1, len(t) + 1):
if s[i - 1] == t[j - 1]:
M[i][j] = M[i - 1][j - 1]
else:
M[i][j] = min(M[i - 1][j] + 1, M[i][j - 1] + 1,
M[i - 1][j - 1] + 1)
print(M[len(s)][len(t)])

Introduction to Alternative Splicing

In this problem we continue looking at subsets. This time we imagine them as exons and want to know how many different combinations we can form from subsets of the set. We are given two integers n and m, which are the length of the set ([1,2,3,4,5,6] in the sample case) and the minimum length of the subsets to be formed. The order of the exons cannot be altered, so for n=3 and m=2 we would get the four subsets [1,2], [1,3], [2,3] and [1,2,3]. The answer should be given modulo 1000000.

Sample Dataset
6 3

Expected Output
42

The problem can quite easily be solved by counting all the different combinations of each length from m to n, and then summarising them to form the final answer. The number of combinations can be calculated as n!/(k!(n-k)!), where k is the length of the subset. In python3 code this can look like:

from math import factorial
n, m = 6, 3
subsets = 0
for k in range(m, n + 1):
subsets += (factorial(n) // (factorial(k) * factorial(n - k)))
print(subsets % 1000000)

At first I used / instead of //, which worked fine for the sample dataset, but when I downloaded another dataset the numbers were bigger and therefore they got too big for float division and I got the error message "OverflowError: integer division result too large for a float". A quick Google search took me to this thread on Stack Overflow that solved my problem.

Tuesday 23 August 2016

Counting Subsets

In this problem we are introduced to set theory (causing flashbacks from the linear algebra course...). For a set {1,2,3,...,n}, we are asked to return the number of subsets of the set. What we need to remember is that the empty set {} and the whole set {1,2,3,...,n} are also subsets (aptly explained here). So, to find the total number of subsets, all we need to do is calculate 2^n. Since this number quickly gets very large, the problem states that we should return the answer modulo 1 000 000.

Sample Dataset
3

Expected Output
8

The following program is probably the shortest one yet:

n = 3

print(2**n % 1000000)

Or, if you like, a fancier version where the user can input n in the terminal:

n = int(input('Enter n: '))

print(2**n % 1000000)

Monday 22 August 2016

Matching Random Motifs

This problem is very similar to “Introduction to Random Strings”. In this case however, we are given a sting s, and are asked to calculate the probability of randomly forming s when generating N nr of stings of equal length of s and a GC content of x.

Sample Dataset
90000 0.6
ATAGCCGA

Sample Output
0.689

To calculate the probability of randomly getting a string that equals s when constructing it with GC content x, we can use the same equation as in introduction to random strings. To go from this probability (lets call it s_prob) to the probability of getting at least one string that is equal to s we use the following equation:

P(at least 1 match of s) = 1 − P(no matches out of N strings) = 1 − [1 - s_prob]^N

The following program makes the above calculations and outputs the answer with three significant figures:

N = 90000
x = 0.6
s = 'ATAGCCGA'
AT = 0
GC = 0
for nt in s:
if nt == 'A' or nt == 'T':
AT += 1
elif nt == 'G' or nt == 'C':
GC += 1

s_prob = (((1 - x) / 2)**AT) * (((x) / 2)**GC)
prob = 1 - (1 - s_prob)**N
print('%0.3f' % prob)

Wednesday 17 August 2016

Creating a Distance Matrix

This problem was fairly similar to "Counting Point Mutations" and "Error Correction in Reads". We are given a set of DNA strings and are asked to return the distance matrix of the strings. The distance between two given strings can be calculated by dividing the Hamming distance (i.e. the number of nucleotides that differ between the stings) with the length of the sequences (given that all sequences are of equal length). These values should then be printed on matrix form with 5 significant figures.

Sample Dataset

>Rosalind_9499

TTTCCATTTA

>Rosalind_0942

GATTCATTTC

>Rosalind_6568

TTTCCATTTT

>Rosalind_1833

GTTCCATTTA

Expected Output

0.00000 0.40000 0.10000 0.10000

0.40000 0.00000 0.40000 0.30000

0.10000 0.40000 0.00000 0.20000

0.10000 0.30000 0.20000 0.00000

To write this program I reused parts of the code from "Counting Point Mutations" and "Error Correction in Reads" and modified it to suit this problem. The following is the final code, which took me only 20 minutes to write. Hurray!

from Bio import SeqIO

reads = []

with open('sampledata.fasta', 'r') as f:

for record in SeqIO.parse(f, 'fasta'):

reads.append(str(record.seq))

read_len = len(reads[0])

for curr_read in reads:

distance = []

for comp_read in reads:

hamming = 0

for nt1, nt2 in zip(curr_read, comp_read):

if nt1 != nt2:

hamming += 1

distance.append(str.format('{0:.5f}', hamming / read_len))

print(*distance, sep=' ')

Tuesday 16 August 2016

Maximum Matchings and RNA Secondary Structures

This problem is very similar to the problem presented in "Perfect Matchings and RNA Secondary Structures". The difference is that in this problem we don't have the same number of A as U and G as C, so we don't get any perfect matchings. Instead we are given the task to find the total possible number of maximum matchings for a given RNA string.

Sample Dataset
>Rosalind_92
AUGCUUC

Expected Output
6

As in "Perfect Matchings and RNA Secondary Structures" the programming is very simple and it all comes down to the maths. We can simplify the problem by looking at AU and GC separately, and can describe the number of maximum matches for each part as max(A,U)!/(max(A,U)-min(A,U))! and max(G,C)!/(max(G,C)-min(G,C))!, respectively. To get the total number of maximum matches, simply multiply the answers.

For the first dataset I tried, I ran the program using Python 3.5 and got the result 434412400153932848464251158517855537428692992. This was deemed incorrect by Rosalind. After seeing a tip online, I ran the program using Python 2.7, and got 434412400153932864602133769790674698240000000 instead. So I tried my program with a new dataset and ran it with Python 2.7 and this time my answer was accepted.

Here follows the code I wrote to solve this problem (works in both 2.7 and 3.5):

from Bio import SeqIO
from math import factorial
sequence = ''
with open('sampledata.fasta', 'r') as f:
for record in SeqIO.parse(f, 'fasta'):
sequence = str(record.seq)

A, U, G, C = 0, 0, 0, 0
for nt in sequence:
if nt == 'A':
A += 1
elif nt == 'U':
U += 1
elif nt == 'G':
G += 1
elif nt == 'C':
C += 1

AU = factorial(max(A, U)) / factorial(max(A, U) - min(A, U))
GC = factorial(max(G, C)) / factorial(max(G, C) - min(G, C))
print(int(AU * GC))

Monday 15 August 2016

Ordering Strings of Varying Length Lexicographically

In this problem we are given a string representing an alphabet. We are also given an integer n ≤ 4. We are to return a list of all strings of length at most n that can be formed from the alphabet string. Repeats should be included and the strings should be sorted lexicographically according to the given alphabet (and this time they really should, as opposed to in “Enumerating k-mers Lexicographically”).

Sample Dataset
D N A
3

Expected Output
D
DD
DDD
DDN
DDA
DN
DND
DNN
DNA
DA
DAD
DAN
DAA
N
ND
NDD
NDN
NDA
NN
NND
NNN
NNA
NA
NAD
NAN
NAA
A
AD
ADD
ADN
ADA
AN
AND
ANN
ANA
AA
AAD
AAN
AAA

To begin with this problem is fairly similar to “Enumerating k-mers Lexicographically”, so I started by having a look at what I wrote there. As in that problem, I decided to use itertools, but this time, since the answer should include repeats, I used product instead. I also put the creation of the permutations into a for-loop in order to get all the different lengths, as you can see below. This generated a nested list, which I then flattened into a single list using the itertools function chain(). Then all that was left to do was to sort and print the permutations to a file. In order to sort them, I made use of the function sorted() and used a lambda function as key. Here is an explanation of the lambda function in Python. The final code can be seen below:

import itertools
n = 3
alphabet = 'DNA'
perm = []
for i in range(1, n + 1):
perm.append(list(map(''.join, (itertools.product(alphabet, repeat=i)))))
permutations = list(itertools.chain(*perm))
srt_perm = sorted(permutations,
key=lambda word: [alphabet.index(c) for c in word])
with open('answer.txt', 'a') as f:
for j in srt_perm:
f.write('%s\n' % j)

Tuesday 9 August 2016

Finding a Shared Spliced Motif

In this problem we are asked to find the longest common subsequence of two DNA-sequences, s and t. As opposed to a substring, a subsequence does not need to occur contiguously in s and t. To solve this problem it might be a good idea to have a look at the Wikipedia page on the longest common subsequence problem.

Sample Dataset
>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA

Expected Output
AACTGG (or any other if there are multiple subsequences of the same length)

The following is a solution to the problem using dynamic programming:

from Bio import SeqIO
sequences = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
sequences.append(str(record.seq))
s = sequences[0]
t = sequences[1]

lengths = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
#creates array of len(s) containing arrays of len(t) filled with 0
for i, x in enumerate(s):
for j, y in enumerate(t):
if x == y:
lengths[i + 1][j + 1] = lengths[i][j] + 1
else:
lengths[i + 1][j + 1] = max(lengths[i + 1][j], lengths[i][j + 1])

spliced_motif = ''
x, y = len(s), len(t)
while x * y != 0:
if lengths[x][y] == lengths[x - 1][y]:
x -= 1
elif lengths[x][y] == lengths[x][y - 1]:
y -= 1
else:
spliced_motif = s[x - 1] + spliced_motif
x -= 1
y -= 1
print(spliced_motif)

Monday 8 August 2016

Speeding Up Motif Finding

In this problem we are looking at how to speed up motif finding using the Knuth-Morris-Pratt algorithm. A good explanation of the algorithm can be found here. We are given a DNA-string and are expected to return the failure array for the sting.

Sample Dataset
>Rosalind_87
CAGCATGGTATCACAGCAGAG

Expected Output
0 0 0 1 2 0 0 0 0 0 0 1 2 1 2 3 4 5 3 0 0

The following code is what I ended up with:

from Bio import SeqIO
record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = list(record.seq)

F_array = [0] * len(sequence)
k = 0
for i in range(2, len(sequence) + 1):
while k > 0 and sequence[k] != sequence[i - 1]:
k = F_array[k - 1]
if sequence[k] == sequence[i - 1]:
k += 1
F_array[i - 1] = k

with open('array.txt', 'w') as answer:
answer.write(' '.join(map(str, F_array)))

Friday 5 August 2016

Code School - Try R

I first encountered R during one of my master's courses at KTH. The course was in bioinformatics and biostatistics and contained a practical part during which we got to do some statistical analysis and data visualization using R. It has been a while since I last used R, so I felt that a recap of the language would be a good idea.

The top hit I got on Google was the course Try R from Code School. The course is very basic and would suit someone who is a complete beginner, or, like me, just wants a quick recap. The commands are entered in the browser and there is no need to install R on your computer for this course. It goes through the basic commands in R, different data structures like vectors and matrices, visualisation and some statistics.

I worked through this tutorial pretty quickly and a lot of what I had learned previously came back to me. I installed the latest version of R on my computer with a little help of this site and will continue with some more advanced tutorials now.

Thursday 4 August 2016

k-Mer Composition

In this problem we are given a DNA-string for which we are to return a matrix consisting of the frequencies of all possible 4-mers in the string, ordered alphabetically.

Sample Dataset
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGGCCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGTTTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCAAATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCGGGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGACTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG

Expected Output
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1

Regardless of the given sequence, the possible 4-mers are always the same and we can generate them with the following code:

import itertools
nt = 'ACGT' #Use this order of nt to get correct order later without sorting
permutations = itertools.product(nt, repeat=4)

kmers = []
for i, j in enumerate(list(permutations)):
kmer = ''
for item in j:
kmer += str(item)
kmers.append(kmer)

This gives us a list of all the possible 4-mers in alphabetical order (note that the funktion sorts the permutations in the order the letters are listed in. If nt is not entered alphabetically you will need to sort kmers).

Now we can extrakt the sequence from the FASTA file and use regex to find all the occurrences of the k-mers. Remember to use ?= in the pattern to include overlapping k-mers:

import re
from Bio import SeqIO

record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = record.seq

A = []
for k in kmers:
occurence = 0
pattern = re.compile(r'(?=(' + k + '))')
for l in re.findall(pattern, str(sequence)):
occurence += 1
A.append(occurence)
print(*A, sep=' ')

Counting Phylogenetic Ancestors

This was probably the easiest problem so far. In fact, you don't even need to do any programming because the calculation is so simple. The only thing you need to realise is that for any unrooted binary tree with n leaves, the number of internal nodes is equal to n-2.

An easy way to convince yourself of this is to doodle the trees on some paper. Then you can see that a tree with 3 leaves has 1 internal node, 4 leaves have 2 internal nodes, 5 leaves have 3 internal nodes, 6 has 4, 7 has 5, 8 has 6 and so on:

Wednesday 3 August 2016

Error Correction in Reads

In this problem we are looking at the errors that occur during genome sequencing. We are given a set of reads of equal length, some of which contain an error (one nt is exchanged). The reads are either of the following:

The read was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement)
The read is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement)

From this dataset we are expected to find the incorrect reads and correct them.

Sample Dataset

>Rosalind_52

TCATC

>Rosalind_44

TTCAT

>Rosalind_68

TCATC

>Rosalind_28

TGAAA

>Rosalind_95

GAGGA

>Rosalind_66

TTTCA

>Rosalind_33

ATCAA

>Rosalind_21

TTGAT

>Rosalind_18

TTTCC

Expected Output

TTCAT->TTGAT

GAGGA->GATGA

TTTCC->TTTCA

I really enjoyed working on this problem and I managed to solve it pretty quickly. A problem I had though was that I would get the correct result for the sample dataset, but when I tried with a 'real' dataset I got way too many reads in my answer, something like 4 times as many as the input reads. I looked at my output and could conclude that a lot of the reads occurred multiple times. I figured something was wrong in the part of my program where I build the list of correct reads, and sure enough, I managed to find the error and fix it (highlighted below). The following code is what I finally ended up with:

from Bio import SeqIO

from Bio.Seq import Seq

from Bio.Alphabet import generic_dna

reads = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

reads.append(str(record.seq))

handle.close()

right = []

wrong = []

for i, j in enumerate(reads):

read = Seq(j, generic_dna)

rev_read = read.reverse_complement()

for k in range(i + 1, len(reads)):

if read == reads[k] or rev_read == reads[k]:

if read not in right and rev_read not in right:

right.append(str(read))

right.append(str(rev_read))

for l in reads:

if l not in right:

wrong.append(l)

for incorrect in wrong:

for correct in right:

hamming = 0

for nt1, nt2 in zip(incorrect, correct):

if nt1 != nt2:

hamming += 1

if hamming > 2:

break

if hamming == 1:

with open('answer.txt', 'a') as textfile:

print(incorrect, '->', correct, sep='', file=textfile)

Monday 1 August 2016

Completing a Tree

In this problem we are asked to find out the minimum number of edges we need to add to a graph of n nodes described by a given adjacency list in order to produce a tree. I quite quickly managed to overthink the problem and got caught up trying to build the graphs described by the adjacency list, until I finally realised there is a much simpler solution.

Sample dataset:

1 2

2 8

4 10

5 9

6 10

7 9

Expected output:

The key to this problem is to realise that a tree with n nodes contains n-1 edges. If we then look at the adjacency list, every element of this list represents a node. Thus, to obtain the number of nodes needed to produce a tree, we simply need to make the calculation n-1-elements in the list. The following Python code does just that:

data = []

with open('sampledata.txt', 'r') as f:

for line in f:

split_data = [int(x) for x in line.split()]

data.append(split_data)

n = data[0][0]

edges = data[1:]

print(n - len(edges) - 1)