Sara does Bioinformatics: November 2016

Wednesday 30 November 2016

Matching a Spectrum to a Protein

In this problem we are given a set of n protein strings and a multiset of weights corresponding to the complete spectrum of some unknown protein string. We are then asked to find the maximum multiplicity and the protein string that corresponds to it. My solution to the problem can be found on my GitHub. For each protein string, it calculates the weight of each prefix and suffix and checks if they correspond to any of the given weights of the protein spectrum. Each time they do a counter is increased by one for that specific protein string. Finally, the highest value is printed together with the corresponding string.

Sample Dataset
4
GSDMQS
VWICN
IASWMQS
PVSMGAD
445.17838
115.02694
186.07931
314.13789
317.1198
215.09061

Sample Output
3
IASWMQS

I actually didn't manage to replicate the result above, which is the example given in the problem description. However, when I tried my program with a larger dataset and uploaded my answer to Rosalind, I got the correct result, so I must have done something right!

Thursday 24 November 2016

Global Alignment with Scoring Matrix

In this problem we are asked to find the alignment score of two protein sequences using the scoring matrix BLOSUM62 and a linear gap penalty of 5. I decided to use Biopython's pairwise2 for this, as you can see in the code below and on my GitHub.

from Bio import pairwise2

from Bio import SeqIO

from Bio.SubsMat.MatrixInfo import blosum62

seqs = []

with open('rosalind_glob.fasta', 'r') as f:

for record in SeqIO.parse(f, 'fasta'):

seqs.append(record.seq)

s = seqs[0]

t = seqs[1]

alignments = pairwise2.align.globalds(s, t, blosum62, -5, -5)

print(pairwise2.format_alignment(*alignments[0]))

Note that the gap open penalty and the gap extension penalty are both set to -5, as we are told in the problem description to use a linear gap penalty of 5.

Thursday 17 November 2016

Creating a Character Table from Genetic Strings

In this problem we are given a set of DNA strings and are asked to create a character table. My solution to the problem can be found here. I got a bit hung up on the problem to start with because in the description they give the following sample data and output:

Sample Dataset
ATGCTACC
CGTTTACC
ATTCGACC
AGTCTCCC
CGTCTATC

Sample Output
10110
10100

It took me a while to realise that the output is only partial, and what they are really after is:

10110
10100
10000
10111
11011
11101

Which is a character array for each non trivial position in the sequences. When I finally had this figured, the programming itself didn't take too long.

Wednesday 16 November 2016

Counting Disease Carriers

Once again we are looking at probability, this time regarding the occurrence of alleles in a population following the Hardy-Weinberg principle. For a population in genetic equilibrium for some given alleles, we are given the frequency of homozygous recessive individuals for each one and are asked to return the probabilities of any randomly selected individual having at least one recessive allele.

We can divide the population into three groups:

A - Homozygous recessive
B - Heterozygous
C - Homozygous dominant

We know that A + B + C = 1. The probability of picking individuals who are not homozygous dominants is P = 1 - C = A + B, which is what we are after.

If we denote the probability of a chromosome having a recessive allele q, and the probability of a chromosome having a dominant allele p, then since each individual carries two chromosomes a homozygous recessive can be described as A = q^2, a heterozygous individual is B = 2pq, and a homozygous dominant is C = p^2.

If we put this together we get that P = q^2 + 2pq. We also know that p + q = 1. So p = 1-q, which gives us P=q^2 + 2(1-q)q which in turn can be simplified to P = 2A^0.5-A.

Now we have our formula, so all we need to do is write a program that makes the calculation for all the given alleles. You can find my version here or below.

A = []
with open('rosalind_afrq.txt','r') as f:
for nr in f.readline().split(' '):
A.append(float(nr))

for i in A:
print(round(2*i**0.5-i,3),end=' ')

Tuesday 15 November 2016

Newick Format with Edge Weights

If you have already solved "Distances in Trees", then this problem will probably not take that long to complete. When I solved distances in trees, I used the Biopython module Phylo. The problem I had then was that the given trees did not contain any branch lengths, so I had to add this in the program. However in this problem the branch lengths are already included in the trees, so the new program is in fact even shorter than the original. You can find my new version of the code below and on my GitHub.

import sys
from Bio import Phylo
import io

f = open('rosalind_nkew.txt', 'r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

for i, line in pairs:
x, y = line.split()
tree = Phylo.read(io.StringIO(i), 'newick')
sys.stdout.write('%s' % round(tree.distance(x,y)) + ' ')
sys.stdout.write('\n')

Tuesday 8 November 2016

Edit Distance Alignment

This problem builds on a previous one called Edit Distance, for which you can find a solution here. In that problem we were asked to find the minimum number of changes needed to transform a string s into a string t. In this problem, we are asked once again to find this number (the edit distance), but this time we should also print the two augmented strings representing an optimal alignment of s and t.

To do this I went back to my solution of the previous problem and added some code to pick out one of the possible optimal alignments from the matrix formed by the existing program. The code is a bit messy, but for those who are interested it it is available on my GitHub.

Monday 7 November 2016

Constructing a De Bruijn Graph

In this problem we are asked to find the adjacency list corresponding to the De Bruijn graph constructed from a set of short reads and their reverse complements. To do this was actually a lot easier than I initially thought. After a bit of research on De Bruijn graphs and a bit of doodling, I figured out the following three steps to complete the task:

Make the reverse complements and combine with the initial reads.
Add all unique reads (whether they are reverse complements or the original reads) to a set.
For each of the reads in the set, print the prefix followed by the suffix.

I set about to write a program that performs these three steps and the following is what I managed to come up with:

from Bio.Seq import Seq

from Bio.Alphabet import IUPAC

def rev_comp(read):

seq = Seq(read, IUPAC.unambiguous_dna)

rev_seq = seq.reverse_complement()

return(str(rev_seq))

def mk_set(seqs):

S = set()

for i in seqs:

if i not in S:

S.add(i)

C = rev_comp(i)

if C not in S:

S.add(C)

return(S)

data = []

with open('rosalind_dbru.txt','r') as f:

for line in f:

data.append(line.strip())

s = mk_set(data)

for j in s:

print("(" + j[:-1] + ", " + j[1:] + ")")

The code can also be found on my GitHub along with the sample data and problem description.