Sara does Bioinformatics: July 2016

Friday 29 July 2016

Transitions and Transversions

This problem was a really quick one. It took me less than 20 minutes to solve! Hurray!

We are asked to compare two sequences of equal length and classify the mutations as either transitions (substituting a purine to another purine or a pyrimidine to another pyrimidine) or transversions (substituting a purine to a pyrimidine or vice versa). We should then return the transition/transversion ratio for the sequences.

Sample dataset:
>Rosalind_0209
GCAACGCACAACGAAAACCCTTAGGGACTGGATTATTTCGTGATCGTTGTAGTTATTGGA
AGTACGGGCATCAACCCAGTT
>Rosalind_2200
TTATCTGACAAAGAAAGCCGTCAACGGCTGGATAATTTCGCGATCGTGCTGGTTACTGGC
GGTACGAGTGTTCCTTTGGGT

Expected output:
1.21428571429

The problem is very similar to Counting Point Mutations in which we calculated the Hamming distance. I used my code from that problem as a starting point and this is the altered code:

from Bio import SeqIO

sequences = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequences.append(str(record.seq))

handle.close()

s1 = sequences[0]

s2 = sequences[1]

transition = 0

transversion = 0

AG = ['A', 'G']

CT = ['C', 'T']

for nt1, nt2 in zip(s1, s2):

if nt1 != nt2:

if nt1 in AG and nt2 in AG:

transition += 1

elif nt1 in CT and nt2 in CT:

transition += 1

else:

transversion += 1

print('%0.11f' % (transition / transversion))

Finding a Spliced Motif

This time we are once again asked to find the position of a given subsequence for a given sequence. However, this time we should take into account that the subsequence can be spliced, i.e. it can be split up in the sequence and there can be other nucleotides between the parts. There could be multiple ways that the subsequence can be found in the sequence, but we only have to return one of them in the form of the positions each letter of the subsequence has in the sequence.

Sample dataset:
>Rosalind_14
ACGTACGTGACG
>Rosalind_18

GTA

Expected output:
3 8 10
(or any of the other possible combinations)

My first thought was to look at my solution for Finding a Motif in DNA, but in that problem I used Biopython to find the motifs and I wasn't able to find a way to adapt it to finding spliced motifs. Another thought was to use regular expressions. However, I quite quickly managed to come up with this rather simple solution instead:

from Bio import SeqIO

sequences = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequences.append(str(record.seq))

handle.close()

s = sequences[0]

t = sequences[1]

pos = 0

positions = []

for i in range(len(t)):

for j in range(pos, len(s)):

pos += 1

if len(positions) < len(t):

if t[i] == s[j]:

positions.append(pos)

break

print(*positions, sep=' ')

Thursday 28 July 2016

Enumerating Oriented Gene Orderings

This is yet another problem with permutations. As in the previous problems of this type we get a positive integer n and are supposed to find some permutations. This time we are looking at signed permutations, that is, for all the positive integers in the range 1 to n we should also take into account their corresponding negative value. As in Enumerating Gene Orders, we should print the total number of permutations followed by each of the permutations.

Sample dataset:

Expected output:

-1 -2

-1 2

1 -2

1 2

-2 -1

-2 1

2 -1

2 1

Note that we should not count permutations of the same integer (i. e. -1 1, 2 -1 and so on should not be included).

For this problem we can once again utilise the itertools function permutations. However, this time we need to pair it to the itertools function product. The following is my code:

import itertools

n = 3

permutation = []

nr = 0

for i in itertools.permutations(list(range(1, n + 1))):

for j in itertools.product([-1, 1], repeat=len(list(range(1, n + 1)))):

perm = [a * sign for a, sign in zip(i, j)]

permutation.append(perm)

nr += 1

print(nr)

for i in range(len(permutation)):

print(*permutation[i], sep=' ')

Friday 22 July 2016

Introduction to Random Strings

We know that genomes are not just a random collection of A,T,C and G, and that across a genome there are loads of different motifs, some of which are similar in many different species. But since the human genome is huge, we need to account for the possibility that the subsequence we are looking at was created at random. In this problem we look at a simplified way to calculate the probability that a certain subsequence occurs randomly. We are given a substring of at most 80 bp and an array containing the GC-content of up to 20 random strings, and we are asked to find log10 of the probability that the substring will match the random strings exactly, and return the result in an array of equal length of the given array.

Sample dataset:

ACGATACAA

0.129 0.287 0.423 0.476 0.641 0.742 0.783

Expected output:

-5.737 -5.217 -5.263 -5.360 -5.958 -6.628 -7.009

The probability, P, of the subsequence occuring in a sequence of a GC content x, can be simplified and written as follows, where AC is the total nr of A and C in the subsequence and GC is the total number of G and C:

To solve this problem, we need to write a program that extracts the GC contents of the array and counts AC and GC of the subsequence. Then it's just a simple matter of iterating over the list of GC contents to calculate the probabilities using the above equation. Here is my final version:

import math

AT = 0

GC = 0

with open('sampledata.txt', 'r') as f:

for line in f:

if line[0] != 'A' and line[0] != 'T' and line[0] != 'G' and line[

0] != 'C':

numbers = line.split()

GC_contents = [float(x) for x in numbers]

for i in line:

if i == 'A' or i == 'T':

AT += 1

elif i == 'G' or i == 'C':

GC += 1

probabilities = []

for j in range(len(GC_contents)):

prob = math.log10((((1 - GC_contents[j]) / 2)**AT) * (GC_contents[j] / 2)

**GC)

probabilities.append('%0.3f' % prob)

print(*probabilities, sep=' ')

Thursday 21 July 2016

Partial Permutations

We are back with the permutations once again. This time we are set the task to find the number of partial permutations that can be formed from a set of numbers. We are given two integers, n and k, where the range of 1 to n is the set of numbers and k is the length of the partial mutations and should return the number of partial mutations modulo 1 000 000.

Sample data:
21 7

Expected output:
51200

I wrote two programs that solve this problem. The first uses the math module permutations to find all possible permutations and count them. You can have a look at the code below. However, this program is inefficient which is why I chose to write a second, faster program.

Slow approach:

import itertools
n = 21
k = 7
count = 0
for p in itertools.permutations(range(1, n + 1), k):
count += 1
print(count % 1000000)

Faster approach:

n = 21
k = 7

partial_perm = 1

for i in range(k):
partial_perm *= (n - i)
print(partial_perm % 1000000)

Both of the programs work and complete the problem within the time limits of Rosalind, but the second program is much faster than the first one.

Perfect Matchings and RNA Secondary Structures

This is yet another problem where the programming is really simple, but we need too figure out the maths before we can write the program.

The problem is linked to RNA folding and we are assuming that in the given RNA string, every nucleotide forms part of a base pair in the RNA molecule. We then look at the n long molecule as a graph containing n nodes where the A-nodes can form an edge (i.e base pair) with the U-nodes and the G-nodes can form an edge with the C-nodes. From this scenario we are asked to find all possible perfect matchings, where a matching is a set of edges in a graph where none of the edges include the same node. A matching is said to be perfect if for a graph of 2n nodes it contains n edges.

Given dataset:
>Rosalind_23

AGCUAGUCAU
(The sequence always contains equally many A's as U's and equally many G's as C's)

Expected Output:
12

What we need to realise is that what we are looking at can be viewed as two separate graphs, one for AU bonding and one for GC bonding. These two are complete bipartite graphs and we can describe the number of perfect matchings for each graph, if n1 = nr of A's and n2 = nr of G's, as n1! and n2!. To get the total amount of perfect matchings for the two graphs combined, we simply multiply these numbers. We thereby get:

matchings = n1! * n2!

All that is left to do now is to write a simple program to run the calculations:

from Bio import SeqIO

from math import factorial

sequence = ''

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequence = str(record.seq)

handle.close()

AU = 0

GC = 0

for nt in sequence:

if nt == 'A':

AU += 1

elif nt == 'G':

GC += 1

matchings = factorial(AU) * factorial(GC)

print(matchings)

Wednesday 20 July 2016

Genome Assembly as Shortest Superstring

Before reading any further in this post, I feel that I must warn you: my solution to this problem is very inefficient, not to say very ugly and probably hard to follow for anyone who isn't me. I know, great programming practice right here! I am still a bit proud of the solution though, because even though it might be the most hideous thing you have seen this month (year?) I managed to come up with it almost entirely without looking things up! And, hey, at least it managed to finish within the five minute limit! So here it goes... Please try not to laugh too hard at me...

Problem Description:
For a set of sequences in FASTA format, we are asked to find the shortest superstring that they make up. The criteria on the sequences is that there exists a unique way to reconstruct the superstring from them by putting together pairs of reads that overlap by more than half their length.

Sample Dataset:
>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59

GCCGGAATAC

Expected output:
ATTAGACCTGCCGGAATAC

Final code:
from Bio import SeqIO
reads = []
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
reads.append(str(record.seq))
handle.close()

overlaps = []
overlapping = []
for i in range(len(reads)):
curr_read = reads[i]
for j in range(len(curr_read) // 2, len(curr_read)):
curr_suffix = curr_read[-(j + 1):]
for k in range(len(reads)):
curr_comp = reads[k]
for l in range(len(curr_comp) // 2, len(curr_comp)):
curr_prefix = curr_comp[:l]
if curr_suffix == curr_prefix:
overlaps.append(k)
overlapping.append([len(curr_suffix), i, k])

s = set(overlaps)
first_read = ''
count = len(overlapping)
for m in range(len(overlapping)):
suf_index = overlapping[m][1]
if suf_index not in s: #find first read and initialise new str
first_read = suf_index
new_str = reads[overlapping[m][1]] + reads[overlapping[m][2]][
overlapping[m][0]:]
count -= 1
pref_index = overlapping[m][2]
while count > 0: #when the first read is found, add
for n in range(len(overlapping)): #the remaining in the correct order
if pref_index == overlapping[n][1]:
new_str += reads[overlapping[n][2]][overlapping[n][0]:]
pref_index = overlapping[n][2]
count -= 1

print(new_str)

Yeah, so... Sorry about that.

Monday 18 July 2016

Longest Increasing Subsequence

This problem had me going for quite a while, and I ended up getting a lot of help from various places on the internet. But I have now managed to solve it and I will try to explain my solution here. Let's start with the description.

In this problem we are given a permutation of length n (where n is smaller than or equal to 10 000) and we are asked to find the longest increasing and decreasing subsequence of the permutation. A subsequence of a permutation is a collection of elements from the permutation in the order that they appear (note that they don't have to come directly after one another). The subsequence is said to be increasing if the elements increase (from left to right) and it is said to be decreasing if the elements decrease (again, left to right). For example, in the permutation (5, 1, 4, 2, 3) the subsequence (1, 2, 3) is increasing and the subsequence (5, 4, 2) is decreasing.

The fact that the subsequence can 'skip' numbers messed up my initial solution for this problem, but after spending some time trying to figure out how the algorithm described in this Wikipedia page works and looking through a very good explanation of it on Stack Overflow (as well as working through the algorithm with pen and paper to figure out how it operates. I know, I'm a tad bit old-school), I managed to adapt the code on Stack Overflow to also find the longest decreasing subsequence. The code below is what I ended up with. The function for finding the decreasing subsequence is talmost he same as for finding the increasing subsequence, so I have highlighted the differences. I would highly recommend having a look at the thread on Stack Overflow if you would like a good explanation of the algorithm.

data = [] #This first bit reads the file which
with open('sampledata.txt', 'r') as f: #contains the the length of the
for line in f: #permutation and then the permutation,
for nr in line.split(): #separated by spaces. The numbers are
data.append(int(nr)) #appended to a list as integers.
perm = data[1:]

def increasing(seq):
P = [None] * len(seq)
M = [None] * len(seq)

L = 1
M[0] = 0
for i in range(1, len(seq)):
lo = 0
hi = L
if seq[M[hi - 1]] < seq[i]:
j = hi
else:
while hi - lo > 1:
mid = (hi + lo) // 2
if seq[M[mid - 1]] < seq[i]:
lo = mid
else:
hi = mid

j = lo
P[i] = M[j - 1]
if j == L or seq[i] < seq[M[j]]:
M[j] = i
L = max(L, j + 1)

result = []
pos = M[L - 1]
for k in range(L):
result.append(seq[pos])
pos = P[pos]

return (result[::-1])

def decreasing(seq):
  P = [None] * len(seq)
  M = [None] * len(seq)

  L = 1
  M[0] = 0
  for i in range(1, len(seq)):
  lo = 0
  hi = L
  if seq[M[hi - 1]] > seq[i]:
  j = hi
  else:
  while hi - lo > 1:
  mid = (hi + lo) // 2
  if seq[M[mid - 1]] > seq[i]:
  lo = mid
  else:
  hi = mid

  j = lo
  P[i] = M[j - 1]
  if j == L or seq[i] > seq[M[j]]:
  M[j] = i
  L = max(L, j + 1)

  result = []
  pos = M[L - 1]
  for k in range(L):
result.append(seq[pos])
  pos = P[pos]

  return (result[::-1])

incr = increasing(perm)
decr = decreasing(perm)

print(*incr)
print(*decr)

Thursday 14 July 2016

Enumerating k-mers Lexicographically

In this problem we are given a set of characters and a positive integer n and we are asked to return all the strings of length n that can be formed from the characters (including duplicates). The answer should be sorted lexicographically according to the order that the characters are presented in the sample data (i.e. not alphabetically). Once again the built-in Python library itertools comes in handy. This time I used the function products because the problem asks for duplicates as well (that is, each letter may occur up to n times in a string an not only once). The following is the code I came up with:

import itertools

n = 4

s = ['H', 'U', 'P', 'M']

perm = itertools.product(s, repeat=n)

for i, j in enumerate(list(perm)):

permutation = ''

for item in j:

permutation += str(item)

with open('answer.txt', 'a') as text_file:

print(permutation.strip(), file=text_file)

Now, this code works just fine, and when I look at the results they match what is described in the problem description, but I can't seem to get my answers to pass. I have tried seven different datasets now, either pasting the answer into the browser or uploading the answer file. I have made sure there are no additional blank spaces or new lines and even tried concatenating the answer into one long string, but nothing I do seems to work. I have noticed that several people have had the same problem rather recently, which has lead me to believe that the problem is probably not from my side, but rather from Rosalind's side. Either way, I think I will just let this problem rest for a while and press on with the other problems instead.

EDIT:

The problem has finally been solved! Apparently, the answer SHOULD be sorted alphabetically, even though it says not to in the note below the problem statement. Also, according to someone in the questions section, the permutations should not be separated with new lines but instead with blank spaces. This is rather annoying because this is exactly what my first attempt at the program did, but then I read the note and changed it to the code above... Anyhow, I can now move on with my life. Below is the new code that produces results that are accepted by Rosalind, although that isn't what they asked for...

import itertools

n = 4

s = ['H', 'U', 'P', 'M']

perm = itertools.product(s, repeat=n)

answer = []

for i, j in enumerate(list(perm)):

permutation = ''

for item in j:

permutation += str(item)

answer.append(permutation)

sorted_answer = sorted(answer)

with open('answer.txt', 'a') as text_file:

print(*sorted_answer, sep=' ', file=text_file)

Wednesday 13 July 2016

RNA Splicing

This time we are looking at splicing. When mRNA has been transcribed from a DNA strand it often contains introns that have to be spliced away before the mRNA is translated into a protein. In this task we are given a FASTA file containing the sequences of one DNA strand and several introns. The goal is to remove the introns and print the resulting protein.

So, I wrote the following program that does exactly that:

from Bio import SeqIO

from Bio.Seq import Seq

from Bio.Alphabet import generic_dna

sequences = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequence = ''

for nt in record.seq:

sequence += nt

sequences.append(sequence)

handle.close()

long_seq = sequences[0]

introns = sequences[1:]

for i in range(len(introns)):

long_seq = long_seq.replace(introns[i], '')

long_seq = Seq(long_seq)

print(long_seq.translate(to_stop=True))

Tuesday 12 July 2016

Locating Restriction Sites

In this problem we are asked to find the reverse palindromes of a given DNA sequence. A piece of DNA is said to be a reverse palindrome when it is equal to it's reverse complement.

This problem was quite similar to the previous problem "Finding a shared motif" so I adapted that code slightly and ended up with this:

from Bio import SeqIO

record = SeqIO.read('sampledata.fasta', 'fasta')

frw_seq = str(record.seq)

rev_seq = str(record.seq.complement())

for i in range(len(frw_seq)):

for j in range(i, len(frw_seq)):

m = frw_seq[i:j + 1]

rev_m = rev_seq[i:j + 1]

if len(m) >= 4 and len(m) <= 12:

if m == rev_m[::-1]:

print(i + 1, len(m))

The program works through the forward string and its complementary string and picks out all pieces that are longer or equal to 4 but shorter or equal to 12. Then it compares each forward piece with its reverse complement and if they are equal, prints the position and the size of the palindrome. There are probably much more efficient ways to do this, but it works and I managed to write the program a lot quicker than I thought I would!

Calculating Protein Mass

In this problem we are asked to calculate the molecular weight of a peptide. The peptide is assumed to come from the middle of a protein so we are to assume that it consists entirely of amino acid residues (meaning we don't have to account for the extra weight of the "water molecule" present when we include the two edges of the protein).

I wrote two programs for this problem, the first one uses Biopython and the second doesn't. The following code is the one using Biopython:

from Bio.Seq import Seq

from Bio.Alphabet import generic_protein

from Bio.SeqUtils import molecular_weight

with open('sampledata.txt', 'r') as f:

for line in f:

prot_seq = line.strip('\n')

print('%0.3f' % (molecular_weight(

Seq(prot_seq, generic_protein),

monoisotopic=True) - 18.01056))

This program uses the Biopython function molecular_weight from SeqUtils. The function sets monoisotopic=False by default, but because the problem specified that we should use the monoisotopic weitghts we need to set it to monoisotopic=True. Also, the function includes the extra weight of one water molecule, so we need to remove that manually (hence the - 18.01056). Come to think of it there is an option called circular that states that the sequence has no ends. Maybe that would work as well?

Anyhow, I thought I'd try to make this program without the use of Biopython as well, and the following is what I came up with:

weights = {'A': 71.03711,

'C': 103.00919,

'D': 115.02694,

'E': 129.04259,

'F': 147.06841,

'G': 57.02146,

'H': 137.05891,

'I': 113.08406,

'K': 128.09496,

'L': 113.08406,

'M': 131.04049,

'N': 114.04293,

'P': 97.05276,

'Q': 128.05858,

'R': 156.10111,

'S': 87.03203,

'T': 101.04768,

'V': 99.06841,

'W': 186.07931,

'Y': 163.06333}

with open('sampledata.txt', 'r') as f:

for line in f:

prot_seq = line.strip('\n')

weight = 0

for aa in prot_seq:

weight += weights[aa]

print('%0.3f' % weight)