Sara does Bioinformatics: June 2016

Wednesday 29 June 2016

Mortal Fibonacci Rabbits

In this Rosalind problem we are revisiting the Fibonacci Rabbits. This time the bunnies sadly aren't immortal and will die after m months. Apart from that they behave as in the previous problem but they only produce one pair of offspring per month in adulthood. This leaves us with two different recurrence relations to calculate the amounts of rabbits present in any given month n. The months before any rabbits start dying off, i.e. when n ≤ m can be described by the recurrence relation of the Fibonacci numbers. The months after this, i.e. when n > m, can be described by the following formula:

Fn = Fn-1 + Fn-2 - Fn-(m+1)

To calculate the number of rabbits after 96 months, when the rabbits are dying after 17 months, I wrote the following program:

n = 96

m = 17

bunnies = [1, 1]

months = 2

while months < n:

if months < m:

bunnies.append(bunnies[-2] + bunnies[-1])

elif months == m or count == m + 1:

bunnies.append(bunnies[-2] + bunnies[-1] - 1)

else:

bunnies.append(bunnies[-2] + bunnies[-1] - bunnies[-(

m + 1)])

months += 1

print(bunnies[-1])

So, care to take a guess what the result after 96 months was? 51159459138167757395. Pairs...

Tuesday 28 June 2016

Consensus and Profile

For this problem we are asked to find a sequence that is as similar to the given set of sequences as possible (the consensus sequence). To complete the task the problem states that we should make a profile matrix containing the number of times each nucleotide (A, T, G or C) occurs at each position of the sequences. We should then be able to build the consensus sequence from this.

I started by breaking down the problem into four different parts. First off, I wanted the program to parse the FASTA-file and add the sequences to a nested list. This way I could then easily iterate over the sequences in the list in order to count the occurrence of each nucleotide in each position. The following code opens the FASTA-file and adds each sequence to a separate list inside the list sequences.

from Bio import SeqIO

sequences = []

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

sequence = []

for nt in record.seq:

sequence.extend(nt)

sequences.append(sequence)

handle.close()

For the second part of the program I needed to count the occurrence of each nucleotide at each position and summarize the results in a profile matrix. I must admit that I had some problems with this part of the program. All those problems were finally explained by the way I had initialized the profile matrix:

profile = [[0]*len(sequences)]*4

What I didn't realize when I wrote this is that the *4 operation makes dependent copies, so while this gave me the look I wanted for my profile matrix, each time I added something to a specific position of one of the nested lists, the same was added to the corresponding position of the other three. When I finally realized this I changed the profile matrix into a numpy matrix and all problems were resolved. The following code iterates over each nucleotide in each of the sequences and counts the number of times they occur at a specific position. The results are added continually to the profile matrix.

import numpy

profile = numpy.zeros((4, len(sequences[0])), dtype=numpy.int)

for i,line in enumerate(sequences):

for j, nt in enumerate(line):

if nt == 'A':

profile[0][j] += 1

elif nt == 'C':

profile[1][j] += 1

elif nt == 'G':

profile[2][j] += 1

elif nt == 'T':

profile[3][j] += 1

For the third part the program needed to evaluate the profile matrix and create the consensus sequence (any one of them would suffice in the case that there were more than one). The following code does just that. If two nucleotides occur equally many times in a position than only the first one in the order given in the program will be added to the consensus sequence.

consensus = ''

for A,C,G,T in zip(profile[0],profile[1],profile[2],profile[3]):

if A >= C and A >= G and A >= T:

consensus += 'A'

elif C >= A and C >= G and C >= T:

consensus += 'C'

elif G >= A and G >= C and G >= T:

consensus += 'G'

elif T >= A and T >= C and T >= G:

consensus += 'T'

In the fourth and final part of the program I just needed it to print the consensus sequence and the profile matrix in the format specified in the problem description. The final part looks like this:

print(consensus)

print('A: ' + ' '.join(str(e) for e in profile[0]))

print('C: ' + ' '.join(str(e) for e in profile[1]))

print('G: ' + ' '.join(str(e) for e in profile[2]))

print('T: ' + ' '.join(str(e) for e in profile[3]))

I used the .join command to get rid of the brackets of the matrix that were otherwise printed.

Friday 24 June 2016

Finding a Motif in DNA

This time we are asked to find the positions of a given motif in a given DNA-sequence. Having just read about Seq Objects in Biopython, I had noticed a function called find, which can be used to find the position of motifs in sequences. However, this function only seems to output the position of the first occurrence of the motif. Instead I decided to go with the Biopython function motifs (have a look at chapter 14.6). Here is my code:

from Bio import motifs

from Bio.Seq import Seq

data = [line.strip('\n') for line in open('sampledata.txt')]

instances =[Seq(data[1])]

m = motifs.create(instances)

sequence = Seq(data[0])

positions = ''

for pos, seq in m.instances.search(sequence):

positions += str(pos+1) + ' '

print(positions)

The function returns all the positions of the motifs, but it seems to have a different definition of the position than Rosalind. In order to receive the correct result I had to add 1 to all the positions, as you can see in line 9. I also added a blank space in this step to achieve the correct formatting.

Translating RNA into Protein

In this task we are asked to translate a RNA-sequence into its corresponding protein sequence. To write this program 'from scratch' would probably take me a lot of time, mainly because there are 64 different codons available when translating RNA into proteins. Instead, there is an excellent function in Biopython for doing this, so I decided to save myself some time and effort and just use this instead. The following is the final code that i came up with:

from Bio.Seq import Seq

from Bio.Alphabet import generic_rna

with open('sampledata.txt','r') as f:

data = f.read()

rnaseq = Seq(data, generic_rna)

protein = rnaseq.translate()

print(protein)

I think this approach was a lot faster than if I had set to write this program from scratch. At this point I feel like I have a good grasp of the basics of Python, and I will continue working more with Biopython when applicable.

Thursday 23 June 2016

Mendel's First Law

This time we are asked to calculate the probability that the offspring of two people in a given population receives a dominant allele for a trait. The people in the population are either homozygous dominant (k), heterozygous (m), or homozygous recessive (n) for the trait.

To solve this problem I felt that the easiest way would be to derive an equation for the probability that the offspring gets a dominant allele and then make a program that makes the calculation based on this equation.

The equation must take two things in consideration. First, it needs to include the probability that each parents can have a set of alleles that is either k, m or n (that is AA, Aa or aa, where A is dominant and a is recessive). Then it also needs to consider the probability of each possible couple receiving a child with one or two dominant alleles. For example, if either of the parents is homozygous dominant, then the probability that the child will have at least one dominant allele is 1, but if both parents are homozygous recessive the probability is 0.

To set up the equation I started drawing the possible combinations of parents and the probability for each of them. Below is a sketch of this (pop = total population = k + m + n).

To receive the probability of a specific couple being randomly selected you simply multiply the probability for the first selection event with that of the second selection event. For exmple, the probability that both parents are homozygous dominant is (k/pop)((k-1)/(pop-1)).

After this step I also needed to consider the probability that the offspring actually gets the dominant allele. This is quite easily figured out using a Punnett square. Then you just multiply the probability for the couple with the probability of them being randomly selected. To get the overall probability of the offspring of a randomly selected couple having a dominant allele you just add them all together.

After simplifying the equation, this is what I ended up with:

Then it was just a matter of writing a simple program to make the calculation given k, m and n. The following is what I ended up with, including the values of k, m and n that I received from Rosalind.

k = 23

m = 26

n = 22

pop = k + m + n

prob = (4*(k*(k-1)+2*k*m+2*k*n+m*n)+3*m*(m-1))/(4*pop*(pop-1))

print(prob)

Wednesday 22 June 2016

Counting Point Mutations

In this problem we are asked to calculate the Hamming distance of two sequences. For this task, my program would need to read the file containing the two sequences and then compare each position of the sequences to see if the nucleotides were the same. Each time there is a difference the program should add 1 to the Hamming distance.

The first thing i wrote was a piece of code to load the two sequences into a list. Because the sequences are separated by a new line in the sample file, I used the command line.strip to remove this from the resulting list. To be able to compare each position of the two sequences I used the command zip(), which iterates over each element of the strings. Then it was just a simple matter of comparing the nucleotides and adding 1 to the variable called "hamming" each time they differed. The following is the final code that I came up with:

seq = [line.strip('\n') for line in open('sampledata.txt')]

hamming = 0

for nt1, nt2 in zip(seq[0],seq[1]):

if nt1 != nt2:

hamming += 1

print(hamming)

Note that I haven't added a check to make sure that the length of the two sequences are the same. For real applications this would definitely be a good idea, but with the data set given from Rosalind I didn't feel it was necessary this time.

A simpler way to write this program would be to use the already existing package Distance which includes a function for calculating the Hamming distance.

Tuesday 21 June 2016

Computing GC Content

This problem felt like a good opportunity to recap my skills in using Biopython. It asks you to calculate the GC-content of some sequences in a FASTA file and print the highest CG-content along with the corresponding sequence ID. To do this, the program first needs to parse the FASTA file, then it needs to calculate the GC-content of the strings, and finally it needs to print the largest GC-content, coupled with the correct ID.

I initially had some trouble getting Biopython to work with Python 3.4, but it turned out that I had installed it to Python 2.7, and not 3.4. When this problem was solved, I set to writing my program.

As I mentioned, the first thing I needed my program to do was to parse the FASTA file. To do this, Biopython has a function called SeqIO. I read up on it here. I then wrote this piece of code which reads the file, calculates the GC-content of the sequences in the file and prints it together with the corresponding ID.

from Bio import SeqIO

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

count = 0

totalcount = 0

print(record.id)

for nt in record.seq:

totalcount = totalcount + 1

if nt == 'G' or nt == 'C':

count = count + 1

percent = count/totalcount*100

print(percent)

handle.close()

This worked fine, but what I needed to do now was to somehow save the highest GC-content, coupled with its ID, and print only that at the end of the program. I thought this was going to be tricky, but it turned out to be quite simple. In the following code, which is the final version of the program, I simply added an if-statement to overwrite the variables 'GC' and 'ID' every time a higher GC-content is found.

from Bio import SeqIO

GC = 0

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

count = 0

totalcount = 0

for nt in record.seq:

totalcount = totalcount + 1

if nt == 'G' or nt == 'C':

count = count + 1

percent = count/totalcount*100

if percent > GC:

GC = percent

ID = record.id

print(ID)

print(GC)

handle.close()

Of course, I could have just used the built-in function GC in Biopython instead, but where is the fun in that?

Thursday 2 June 2016

Rabbits and Recurrence Relations

The fourth Rosalind problem is a bit different than the previous ones. This time we are asked to calculate how many rabbit pairs there are after n months if every grown pair produces k new pairs each month and it takes one month for a new pair to become reproductive. The problem describes the Fibonacci sequence and how it can be applied to the simplest version of the problem, the case when k=1. It's been a while since I worked with mathematical sequences, so this was a fun way to dust of things I learned during my bachelor's maths courses.

The Fibonacci sequence works on this problem when k=1 and its recurrence relation is written as follows:

Fn = Fn-1 + Fn-2

The first thing I needed to do was to figure out the recurrence relation for any value on k. After doodling a bit and playing around with the numbers I figured out that this must be it:

Fn = Fn-1 + Fn-2 * k

Once I knew this I wrote the following program:

n = 5

k = 3

big = 1

small = 1

for months in range(1,n-1):

bigger = big + small*k

small = big

big = bigger

print(big)

I named Fn-1 and Fn-2 big and small so that I wouldn't confuse myself which was the larger number and which was the smaller of the two. Since the problem states that F1 = 1 the program is set to only iterate n-1 times to reach the answer for Fn. The first expression in the for-loop calculates Fn. The other two terms update Fn-1 and Fn-2 in preparation of the next iteration.

I really liked this problem, mainly because it felt a bit more challenging than the previous three and because I got to practice some maths skills that I haven't used in a while. It took me a bit longer than the previous ones, about an hour, but most of that time I spent working on the recurrence relation and when I got to the coding part I managed to write the program pretty quickly and with little fuss.

Wednesday 1 June 2016

Complementing a Strand of DNA

In the third Rosalind problem we are asked to return the reverse complement of a DNA strand. As with the previous problems I started the program by defining an empty string and opening the file containing the data. The program then appends the complement of each nucleotide to the empty string that was defined in the beginning.

So far the problem was quite simple but then it came to reversing the string. I haven't reversed a sting before so i googled it and found a thread about it at Stack Overflow. There seems to be many different ways in which to do this, but the most popular one was to use extended slices as this is a very fast way to do this. This solution resulted in the following code:

trans = ''

with open('sampledata.txt') as f:

for line in f:

for nt in line:

if nt == 'A':

trans+='T'

elif nt == 'C':

trans+='G'

elif nt == 'T':

trans+='A'

elif nt == 'G':

trans+='C'

reverse = trans[::-1]

with open('compdna.txt','w') as answer:

answer.write(reverse)

Transcribing DNA into RNA

In Rosalind problem 2 you are asked to transcribe a DNA sequence into an RNA sequence, i.e. changing the T's into U's. As with the first problem, this one was also pretty straightforward and I managed to solve it quickly.

Like the first program, I wanted this one to open a file and read each letter of each row individually, so this part of the program became pretty much the same as the previous program. I then wanted the program to change all the T's into U's and print the new string so I wrote the following code:

with open('sampledata2.txt') as f:

for line in f:

for nt in line:

if nt == 'T':

print('U')

else:

print(nt)

Before I even tried this code I knew it would not give me the formatting I wanted of the results, and upon trying it what I got was the correct letters out but only one letter per new row. So now I knew that the exchange of T to U was working, I just needed to append the letters to the same string and preferably print them to a file (mainly to facilitate uploading the answer to Rosalind).

To append each letter to a new string I started with defining an empty string called rna in the beginning of the program. Then, as the program iterates over each letter, it appends the letter to the string rna using the command += . When the iteration was done I wanted to save the string to a file that I could upload to Rosalind and to do this I used with open. Below is the final code:

rna = ''

with open('sampledata2.txt') as f:

for line in f:

for nt in line:

if nt == 'T':

rna+='U'

else:

rna+=nt

with open('newrna.txt','w') as answer:

answer.write(rna)

Each time I run this program the file newrna.txt gets overwritten with the new result. This is the result of using 'w' and what I wanted for this program. If I instead wanted to append the result to the existing file I could use 'a'.