Tuesday 31 May 2016

Counting DNA Nucleotides

I have been brushing up a bit on my python skills, just playing around a little and writing programs like "guess the number" where the program generates a random number between 1 and 100 and the user gets to guess it with the help of hints whether the guess is too high or too low. It all came back to me pretty quickly so I decided I wanted to try doing something a bit more productive and bioinformatics related. That's when I found Rosalind. This is a site that contains lots of different bioinformatics problems that you are supposed to solve using your programming skills. I have only just scratched the surface of this site, so I'm not familiar with all the features yet, but what I have seen so far is great! I have completed the first two problems and will describe the first one in more detail below. If anyone feels the urge too look me up at the site my username is SaraS.

Problem 1 - Counting DNA Nucleotides

This problem was fairly straightforward. You are to count how many of each nucleotide a given DNA sequence contains. I managed to solve it quickly and the solution was accepted at my first attempt with a real dataset.

The first thing I needed my program to do was to read the input file containing the sequence. I remembered that in the section on code style at the Hitchhiker's Guide to Python they recommend using with open as this automatically closes the file for you. When the program had opened the file, I needed it to read each letter of the string in the file separately and then count how many of each nucleotide the string contained. The program was then to print these four numbers with a blank space in between. The following is the code I came up with:

   Acount = 0                            
   Ccount = 0                            
   Gcount = 0                            
   Tcount = 0                            
   with open('sampledata2.txt') as f:    
       for line in f:                    
           for nt in line:               
               if nt == 'A':             
                   Acount = Acount + 1   
               if nt == 'C':             
                   Ccount = Ccount + 1   
               if nt == 'G':             
                   Gcount = Gcount + 1   
               if nt == 'T':             
                   Tcount = Tcount + 1   
   print(Acount, Ccount, Gcount, Tcount) 

In the first section of the code I define my variables that I then later use to count each of the different nucleotides. The program opens the file named sampledata2.txt, which is the file containing the actual dataset from Rosalind. It then iterates first over each line in the file and then over each letter in the lines. For each letter (named nt in my program) it then checks it the letter is an A and if that's the case it adds 1 to the variable Acount. If the letter is not A it checks if it's a C, and so on. When the iteration is done the program prints the value of the counter variables with a space between them. 

Looking at the program once more, I think it might become more efficient if I changed the if statements for C and G to elif and the if statement for T to else. This doesn't feel necessary for the small datasets given for the problem, but it might be worth remembering in larger projects in the future.

Another improvement that could be made to the program is to wright the answer to a file. This didn't feel that necessary for such a simple answer, but I chose to do this in problem 2, which I will describe in my next post.

If you have any questions about what you have just read, or have any suggestions or tips on how I can improve my code, please don't hesitate to leave a comment below!


Friday 27 May 2016

Good Programming Practices

One thing that I felt was lacking in the courses in bioinformatics that I attended was how to write clean and easily readable code. Although the course "Applied Bioinformatics" touched upon the subject, I feel like it would be a good idea for me to read up a bit on it. Whilst searching for somewhere to do this I found "The Hitchhiker's Guide to Python" which is a site describing best practices for many different aspects of Python, from installation and configuration to daily usage.

I started with reading through the part regarding which version of Python to use, having a look at some of the links as well, and I concluded that the best thing for me would be to use Python3 rather than 2.7. I reached this conclusion mainly because although I first learnt Python in 2.7, I don't feel like I have a great attachment to any particular version, and I'd rather work with the version that is part of the future rather than the past.

I then moved on to the part called "Writing Great Python Code". Although a lot of this section was quite technical and perhaps more directed to people with a background in computer science, I found much of the information really helpful, especially the part on code style. I think the main thing I should focus on as a novice should be to keep my code as simple and readable as I can, and this site gave a lot of good examples on how to do that.

The next step for me will be to start writing some code. I'll start with trying to solve some simple tasks while keeping what I have learned about writing good code in mind and then I'll move on to working on more advanced problems.

Wednesday 25 May 2016

NCBI

One of the most fundamental parts of bioinformatics is knowing how to search for and access biological data in the NCBI databases. My first encounter with these databases was in 2006 during a biology course in secondary school. Since then some basic usage of the site has been included in a couple of different courses during my bachelor's and master's education. Because it has been a while since I last used the NCBI databases I wanted to refresh my knowledge and also learn some new things. To do this I started to look for some tutorials.

The fist tutorial I found was linked from bioinformatics.org. This tutorial was last revised in 2002, so many of the directions were a bit off and some functions have been removed or replaced, but overall the tasks that are given make for a good introduction to the site. Having worked a bit with the databases before, I worked through this tutorial pretty quickly, but it was a nice way to refresh my knowledge.

When I was finished with the first tutorial I felt like doing something a bit more advanced. I found these exercises from an advanced workshop for bioinformatics information specialists. The exercises are based on frequently asked questions to the NCBI and include strategies and step-by-step instructions on how to solve the given problems. I found many of the tasks fairly basic and easy to carry out, but I liked that often more than one method to solve the exercise was described and that there were a lot of tips and tricks. Although I felt like I already knew many of the things included in the exercises, there were some new functions and some databases that I hadn't tried before, such as the dbSNP. Overall this site is a good resource to come back to for answers in the future. 

Apart from the two tutorials mentioned above, I have also looked at some of the video tutorials available at the NCBI website. There is also a ton of online tutorials and exercises from different universities around the world, but for now I think I'll have a look at brushing up on my Python programming skills. 


Monday 23 May 2016

Welcome!

This is a blog about me learning new things in bioinformatics, as well as keeping my knowledge and skills in the field fresh and up to date. I have been studying biotechnology at the Royal Institute of Technology in Stockholm, Sweden, for five years and I have obtained a Bachelor of Science  in biotechnology and a Master of Science in medical biotechnology. The education has to a large extent been focused on the engineering aspects of biotechnology and on innovation. It has also included many different scientific areas other than biotechnology and I have had the opportunity to read several courses in maths, physics, chemistry, computer science, philosophy and industrial economy.

During my master's studies I took the following three courses in bioinformatics:
  • Bioinformatics and Biostatistics - an introduction to bioinformatics and biostatistics. The course consisted of lectures and computer-based exercises and included some basic concepts of statistics and probability, database searching, using R for statistics analysis and visualisation and the theory behind some fundamental bioinformatics analysis methods.
  • Applied Bioinformatics - an introduction to bioinformatics and programming. The course consisted of lectures and computer-based exercises and included working in a UNIX environment, programming using Python and some basic SQL.
  • Analysis of Data from High-Throughput Molecular Biology Experiments - an advanced course in bioinformatics. The course consisted of lectures, computer-based exercises and a larger project performed in teams of three students. The focus was on analysis of biological data, such as RNA-seq data, from large-scale experiments.
These courses were so much fun and they really opened my eyes to bioinformatics. I graduated in 2015 and recently moved to Cambridge, UK. I'm now spending my days learning more about bioinformatics and looking for jobs in the area. 

In the future I will be posting my progress here, but in the meantime you might want to check out my previous bioinformatics blog that I wrote for one of the courses (bearing in mind that it's from 2014...) or my LinkedIn profile.

Have a nice day!