I initially had some trouble getting Biopython to work with Python 3.4, but it turned out that I had installed it to Python 2.7, and not 3.4. When this problem was solved, I set to writing my program.
As I mentioned, the first thing I needed my program to do was to parse the FASTA file. To do this, Biopython has a function called SeqIO. I read up on it here. I then wrote this piece of code which reads the file, calculates the GC-content of the sequences in the file and prints it together with the corresponding ID.
from Bio import SeqIO
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
count = 0
totalcount = 0
print(record.id)
for nt in record.seq:
totalcount = totalcount + 1
if nt == 'G' or nt == 'C':
count = count + 1
percent = count/totalcount*100
print(percent)
handle.close()
This worked fine, but what I needed to do now was to somehow save the highest GC-content, coupled with its ID, and print only that at the end of the program. I thought this was going to be tricky, but it turned out to be quite simple. In the following code, which is the final version of the program, I simply added an if-statement to overwrite the variables 'GC' and 'ID' every time a higher GC-content is found.
from Bio import SeqIO
GC = 0
handle = open('sampledata.fasta', 'r')
for record in SeqIO.parse(handle, 'fasta'):
count = 0
totalcount = 0
for nt in record.seq:
totalcount = totalcount + 1
if nt == 'G' or nt == 'C':
count = count + 1
percent = count/totalcount*100
if percent > GC:
GC = percent
ID = record.id
print(ID)
print(GC)
handle.close()
Of course, I could have just used the built-in function GC in Biopython instead, but where is the fun in that?
No comments:
Post a Comment