Tuesday 21 June 2016

Computing GC Content

This problem felt like a good opportunity to recap my skills in using Biopython. It asks you to calculate the GC-content of some sequences in a FASTA file and print the highest CG-content along with the corresponding sequence ID. To do this, the program first needs to parse the FASTA file, then it needs to calculate the GC-content of the strings, and finally it needs to print the largest GC-content, coupled with the correct ID.

I initially had some trouble getting Biopython to work with Python 3.4, but it turned out that I had installed it to Python 2.7, and not 3.4. When this problem was solved, I set to writing my program.

As I mentioned, the first thing I needed my program to do was to parse the FASTA file. To do this, Biopython has a function called SeqIO. I read up on it here. I then wrote this piece of code which reads the file, calculates the GC-content of the sequences in the file and prints it together with the corresponding ID.


from Bio import SeqIO                      
handle = open('sampledata.fasta', 'r')     
for record in SeqIO.parse(handle, 'fasta'):
    count = 0                              
    totalcount = 0                         
    print(record.id)                       
    for nt in record.seq:                  
        totalcount = totalcount + 1        
        if nt == 'G' or nt == 'C':         
            count = count + 1              
    percent = count/totalcount*100         
    print(percent)                         
handle.close()                             

This worked fine, but what I needed to do now was to somehow save the highest GC-content, coupled with its ID, and print only that at the end of the program. I thought this was going to be tricky, but it turned out to be quite simple. In the following code, which is the final version of the program, I simply added an if-statement to overwrite the variables 'GC' and 'ID' every time a higher GC-content is found.

from Bio import SeqIO                      
GC = 0                                     
handle = open('sampledata.fasta', 'r')     
for record in SeqIO.parse(handle, 'fasta'):
    count = 0                              
    totalcount = 0                         
    for nt in record.seq:                  
        totalcount = totalcount + 1        
        if nt == 'G' or nt == 'C':         
            count = count + 1              
    percent = count/totalcount*100         
    if percent > GC:                       
        GC = percent                       
        ID = record.id                     
print(ID)                                  
print(GC)                                  
handle.close()                             

Of course, I could have just used the built-in function GC in Biopython instead, but where is the fun in that?


No comments:

Post a Comment