Sara does Bioinformatics: Computing GC Content

This problem felt like a good opportunity to recap my skills in using Biopython. It asks you to calculate the GC-content of some sequences in a FASTA file and print the highest CG-content along with the corresponding sequence ID. To do this, the program first needs to parse the FASTA file, then it needs to calculate the GC-content of the strings, and finally it needs to print the largest GC-content, coupled with the correct ID.

I initially had some trouble getting Biopython to work with Python 3.4, but it turned out that I had installed it to Python 2.7, and not 3.4. When this problem was solved, I set to writing my program.

As I mentioned, the first thing I needed my program to do was to parse the FASTA file. To do this, Biopython has a function called SeqIO. I read up on it here. I then wrote this piece of code which reads the file, calculates the GC-content of the sequences in the file and prints it together with the corresponding ID.

from Bio import SeqIO

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

count = 0

totalcount = 0

print(record.id)

for nt in record.seq:

totalcount = totalcount + 1

if nt == 'G' or nt == 'C':

count = count + 1

percent = count/totalcount*100

print(percent)

handle.close()

This worked fine, but what I needed to do now was to somehow save the highest GC-content, coupled with its ID, and print only that at the end of the program. I thought this was going to be tricky, but it turned out to be quite simple. In the following code, which is the final version of the program, I simply added an if-statement to overwrite the variables 'GC' and 'ID' every time a higher GC-content is found.

from Bio import SeqIO

GC = 0

handle = open('sampledata.fasta', 'r')

for record in SeqIO.parse(handle, 'fasta'):

count = 0

totalcount = 0

for nt in record.seq:

totalcount = totalcount + 1

if nt == 'G' or nt == 'C':

count = count + 1

percent = count/totalcount*100

if percent > GC:

GC = percent

ID = record.id

print(ID)

print(GC)

handle.close()

Of course, I could have just used the built-in function GC in Biopython instead, but where is the fun in that?

Sara does Bioinformatics

Tuesday, 21 June 2016

Computing GC Content

No comments:

Post a Comment