Tuesday 28 June 2016

Consensus and Profile

For this problem we are asked to find a sequence that is as similar to the given set of sequences as possible (the consensus sequence). To complete the task the problem states that we should make a profile matrix containing the number of times each nucleotide (A, T, G or C) occurs at each position of the sequences. We should then be able to build the consensus sequence from this.

I started by breaking down the problem into four different parts. First off, I wanted the program to parse the FASTA-file and add the sequences to a nested list. This way I could then easily iterate over the sequences in the list in order to count the occurrence of each nucleotide in each position. The following code opens the FASTA-file and adds each sequence to a separate list inside the list sequences

from Bio import SeqIO                      
sequences = []                             
handle = open('sampledata.fasta', 'r')     
for record in SeqIO.parse(handle, 'fasta'):
     sequence = []                         
     for nt in record.seq:                 
          sequence.extend(nt)              
     sequences.append(sequence)            
handle.close()                             

For the second part of the program I needed to count the occurrence of each nucleotide at each position and summarize the results in a profile matrix. I must admit that I had some problems with this part of the program. All those problems were finally explained by the way I had initialized the profile matrix:

profile = [[0]*len(sequences)]*4

What I didn't realize when I wrote this is that the *4 operation makes dependent copies, so while this gave me the look I wanted for my profile matrix, each time I added something to a specific position of one of the nested lists, the same was added to the corresponding position of the other three. When I finally realized this I changed the profile matrix into a numpy matrix and all problems were resolved. The following code iterates over each nucleotide in each of the sequences and counts the number of times they occur at a specific position. The results are added continually to the profile matrix.

import numpy                                                  
profile = numpy.zeros((4, len(sequences[0])), dtype=numpy.int)
for i,line in enumerate(sequences):                           
     for j, nt in enumerate(line):                            
          if nt == 'A':                                       
               profile[0][j] += 1                             
          elif nt == 'C':                                     
               profile[1][j] += 1                             
          elif nt == 'G':                                     
               profile[2][j] += 1                             
          elif nt == 'T':                                     
               profile[3][j] += 1                             

For the third part the program needed to evaluate the profile matrix and create the consensus sequence (any one of them would suffice in the case that there were more than one). The following code does just that. If two nucleotides occur equally many times in a position than only the first one in the order given in the program will be added to the consensus sequence.

consensus = ''                                                  
for A,C,G,T in zip(profile[0],profile[1],profile[2],profile[3]):
     if A >= C and A >= G and A >= T:                           
          consensus += 'A'                                      
     elif C >= A and C >= G and C >= T:                         
          consensus += 'C'                                      
     elif G >= A and G >= C and G >= T:                         
          consensus += 'G'                                      
     elif T >= A and T >= C and T >= G:                         
          consensus += 'T'                                      

In the fourth and final part of the program I just needed it to print the consensus sequence and the profile matrix in the format specified in the problem description. The final part looks like this:

print(consensus)                                   
print('A: ' + ' '.join(str(e) for e in profile[0]))
print('C: ' + ' '.join(str(e) for e in profile[1]))
print('G: ' + ' '.join(str(e) for e in profile[2]))
print('T: ' + ' '.join(str(e) for e in profile[3]))

I used the .join command to get rid of the brackets of the matrix that were otherwise printed.

No comments:

Post a Comment