Sara does Bioinformatics

Tuesday, 14 March 2017

Shell Scripting

It's been a while since my last post. The reason for this is that I got a job! Wohoo! So for the past two months there has been a great deal of new things to wrap my head around and thus I have not had the same amount of spare time and energy to continue with the Rosalind problems at the same pace as earlier. However, I do plan to continue my bioinformatics education and will continue to write about it here.

One thing that I want to become better at is bash scripting and I feel that this would be very useful in the role I'm in as well. So I did a bit of research and found the "Advanced bash scripting guide", a very in-depth book/tutorial on shell scripting. I have worked through tree chapters and so far there has been a lot of useful information.

That's it for now!

Wednesday, 30 November 2016

Matching a Spectrum to a Protein

In this problem we are given a set of n protein strings and a multiset of weights corresponding to the complete spectrum of some unknown protein string. We are then asked to find the maximum multiplicity and the protein string that corresponds to it. My solution to the problem can be found on my GitHub. For each protein string, it calculates the weight of each prefix and suffix and checks if they correspond to any of the given weights of the protein spectrum. Each time they do a counter is increased by one for that specific protein string. Finally, the highest value is printed together with the corresponding string.

Sample Dataset
4
GSDMQS
VWICN
IASWMQS
PVSMGAD
445.17838
115.02694
186.07931
314.13789
317.1198
215.09061

Sample Output
3
IASWMQS

I actually didn't manage to replicate the result above, which is the example given in the problem description. However, when I tried my program with a larger dataset and uploaded my answer to Rosalind, I got the correct result, so I must have done something right!

Thursday, 24 November 2016

Global Alignment with Scoring Matrix

In this problem we are asked to find the alignment score of two protein sequences using the scoring matrix BLOSUM62 and a linear gap penalty of 5. I decided to use Biopython's pairwise2 for this, as you can see in the code below and on my GitHub.

from Bio import pairwise2

from Bio import SeqIO

from Bio.SubsMat.MatrixInfo import blosum62

seqs = []

with open('rosalind_glob.fasta', 'r') as f:

for record in SeqIO.parse(f, 'fasta'):

seqs.append(record.seq)

s = seqs[0]

t = seqs[1]

alignments = pairwise2.align.globalds(s, t, blosum62, -5, -5)

print(pairwise2.format_alignment(*alignments[0]))

Note that the gap open penalty and the gap extension penalty are both set to -5, as we are told in the problem description to use a linear gap penalty of 5.

Thursday, 17 November 2016

Creating a Character Table from Genetic Strings

In this problem we are given a set of DNA strings and are asked to create a character table. My solution to the problem can be found here. I got a bit hung up on the problem to start with because in the description they give the following sample data and output:

Sample Dataset
ATGCTACC
CGTTTACC
ATTCGACC
AGTCTCCC
CGTCTATC

Sample Output
10110
10100

It took me a while to realise that the output is only partial, and what they are really after is:

10110
10100
10000
10111
11011
11101

Which is a character array for each non trivial position in the sequences. When I finally had this figured, the programming itself didn't take too long.

Wednesday, 16 November 2016

Counting Disease Carriers

Once again we are looking at probability, this time regarding the occurrence of alleles in a population following the Hardy-Weinberg principle. For a population in genetic equilibrium for some given alleles, we are given the frequency of homozygous recessive individuals for each one and are asked to return the probabilities of any randomly selected individual having at least one recessive allele.

We can divide the population into three groups:

A - Homozygous recessive
B - Heterozygous
C - Homozygous dominant

We know that A + B + C = 1. The probability of picking individuals who are not homozygous dominants is P = 1 - C = A + B, which is what we are after.

If we denote the probability of a chromosome having a recessive allele q, and the probability of a chromosome having a dominant allele p, then since each individual carries two chromosomes a homozygous recessive can be described as A = q^2, a heterozygous individual is B = 2pq, and a homozygous dominant is C = p^2.

If we put this together we get that P = q^2 + 2pq. We also know that p + q = 1. So p = 1-q, which gives us P=q^2 + 2(1-q)q which in turn can be simplified to P = 2A^0.5-A.

Now we have our formula, so all we need to do is write a program that makes the calculation for all the given alleles. You can find my version here or below.

A = []
with open('rosalind_afrq.txt','r') as f:
for nr in f.readline().split(' '):
A.append(float(nr))

for i in A:
print(round(2*i**0.5-i,3),end=' ')

Tuesday, 15 November 2016

Newick Format with Edge Weights

If you have already solved "Distances in Trees", then this problem will probably not take that long to complete. When I solved distances in trees, I used the Biopython module Phylo. The problem I had then was that the given trees did not contain any branch lengths, so I had to add this in the program. However in this problem the branch lengths are already included in the trees, so the new program is in fact even shorter than the original. You can find my new version of the code below and on my GitHub.

import sys
from Bio import Phylo
import io

f = open('rosalind_nkew.txt', 'r')
pairs = [i.split('\n') for i in f.read().strip().split('\n\n')]

for i, line in pairs:
x, y = line.split()
tree = Phylo.read(io.StringIO(i), 'newick')
sys.stdout.write('%s' % round(tree.distance(x,y)) + ' ')
sys.stdout.write('\n')

Tuesday, 8 November 2016

Edit Distance Alignment

This problem builds on a previous one called Edit Distance, for which you can find a solution here. In that problem we were asked to find the minimum number of changes needed to transform a string s into a string t. In this problem, we are asked once again to find this number (the edit distance), but this time we should also print the two augmented strings representing an optimal alignment of s and t.

To do this I went back to my solution of the previous problem and added some code to pick out one of the possible optimal alignments from the matrix formed by the existing program. The code is a bit messy, but for those who are interested it it is available on my GitHub.