Tuesday 31 May 2016

Counting DNA Nucleotides

I have been brushing up a bit on my python skills, just playing around a little and writing programs like "guess the number" where the program generates a random number between 1 and 100 and the user gets to guess it with the help of hints whether the guess is too high or too low. It all came back to me pretty quickly so I decided I wanted to try doing something a bit more productive and bioinformatics related. That's when I found Rosalind. This is a site that contains lots of different bioinformatics problems that you are supposed to solve using your programming skills. I have only just scratched the surface of this site, so I'm not familiar with all the features yet, but what I have seen so far is great! I have completed the first two problems and will describe the first one in more detail below. If anyone feels the urge too look me up at the site my username is SaraS.

Problem 1 - Counting DNA Nucleotides

This problem was fairly straightforward. You are to count how many of each nucleotide a given DNA sequence contains. I managed to solve it quickly and the solution was accepted at my first attempt with a real dataset.

The first thing I needed my program to do was to read the input file containing the sequence. I remembered that in the section on code style at the Hitchhiker's Guide to Python they recommend using with open as this automatically closes the file for you. When the program had opened the file, I needed it to read each letter of the string in the file separately and then count how many of each nucleotide the string contained. The program was then to print these four numbers with a blank space in between. The following is the code I came up with:

   Acount = 0                            
   Ccount = 0                            
   Gcount = 0                            
   Tcount = 0                            
   with open('sampledata2.txt') as f:    
       for line in f:                    
           for nt in line:               
               if nt == 'A':             
                   Acount = Acount + 1   
               if nt == 'C':             
                   Ccount = Ccount + 1   
               if nt == 'G':             
                   Gcount = Gcount + 1   
               if nt == 'T':             
                   Tcount = Tcount + 1   
   print(Acount, Ccount, Gcount, Tcount) 

In the first section of the code I define my variables that I then later use to count each of the different nucleotides. The program opens the file named sampledata2.txt, which is the file containing the actual dataset from Rosalind. It then iterates first over each line in the file and then over each letter in the lines. For each letter (named nt in my program) it then checks it the letter is an A and if that's the case it adds 1 to the variable Acount. If the letter is not A it checks if it's a C, and so on. When the iteration is done the program prints the value of the counter variables with a space between them. 

Looking at the program once more, I think it might become more efficient if I changed the if statements for C and G to elif and the if statement for T to else. This doesn't feel necessary for the small datasets given for the problem, but it might be worth remembering in larger projects in the future.

Another improvement that could be made to the program is to wright the answer to a file. This didn't feel that necessary for such a simple answer, but I chose to do this in problem 2, which I will describe in my next post.

If you have any questions about what you have just read, or have any suggestions or tips on how I can improve my code, please don't hesitate to leave a comment below!


No comments:

Post a Comment