A learner's notebook.

April 25, 2014

Bioinformatics Algorithms

Finally, I have completed the Bioinformatics Algorithms (Part-1) course at coursera. I enrolled this course on last November, 2013. But, I could not finish it due to workload both in daily life and in the course. The course design is favorable for a newcomer to bionformatics. The instructors described important problems in genomics, and gave the intuition behind the solutions. They explained the solutions step by step so that a student can learn how a bioinformatician should approach to a problem. They stated the most naive solution, then identified its drawbacks, and improved it. They also showed how biology-informed models can improve the solutions.

A good thing to note that it stated some of the open problems in bioinformatics at the end of each chapters. Those problems are hard. Yet, they give a glimpse of the difficulty level of current research topics in bioinformatics. After completing this course, I am now confident that I know at least a little about bioinformatics. I realized that I have to study hard, especially I have to study biology, to make me competent to deal with biological problems. I realized once again that the key to be successful in this area is the capability to connect biology with computing.

The assignments are a vital part of the course. These assignments forced the students to DO the job, instead of just knowing it. I solved each assignment using either R or Python. I started with R, as I was familiar with it. Later, found that R is slow, particularly if you cannot avoid looping (using for). I often struggled to find a solution within the 5-minutes duration given for a problem. Then I started learning and using Python. To my surprise, I found that Python is pretty fast and its syntaxes are quite powerful. Especially, I liked list comprehension and lambda expression in Python.

I also started using github to store my codes on a regular basis. Although I had a github account before, I seldom used it. This time, I stored all the solutions at github. Here is the repository.

April 2, 2014

Short read sequencing or Long read sequencing?

Genome sequencing is a hot topic these days. Currently, the popular method of sequencing is to generate millions of short reads, typically 50 to 150 nucleotides long, and then assemble the reads in computational approach. Illumina, almost having a monopoly in sequencing business, follows this strategy. However, this strategy has some drawbacks. For example, it reads genome from multiple cells, and the biological signals in those cells are averaged to generate a consensus sequence. Consequently, it cannot identify the molecular-level biological differences. Moreover, this strategy does not work well with repetitive sequences or heterozygous sequences.

In contrast, long reads can be used for sequencing. These reads can be 100 times longer that short reads. Thus the long reads have fundamentally more information than short ones. Long reads can help uniquely map the reads in complex regions including repetitive elements. However, long reads currently suffers from an elevated error rate, about 15%. That means, one in every 7 or 8 bases is incorrect. Due to this limitation, long reads alone are yet not suitable for sequencing. However, a combination of short and long reads can perform much substantially better than any of the two methods.

Pacific Biosciences, a biotechnology company, focuses on long reads. They are trying to improve the error correction algorithm so that sequencing can be performed only from the long reads, without using the short reads. That would be a great achievement, as it would reduce the cost, and also enable identification of heterozygous and repetitive elements. Thus, we may expect that the the monopoly of Illumina would be reduced.

Another biotech company, Oxford Nanopore, is also in the race. They follow a different technology. They use the characteristic conductance change when single-stranded DNA passes through or near the nanopore, a small hole of the order of 1 nanometer in internal diameter. This strategy also produces long reads from single cell. Although this approach suffers from a high error rate, it has been shown in an experiment that more than 80% of the reads had perfect 50-nucleotides sections. This is impressive. If a proper error correction algorithm can be devised, Oxford Nanopore can be beat the dominance of Pacific Biosciences in the long-read field.

To make the scenario more interesting, GynapSys, another biotech company, aims at developing a small all-electronic instrument, like an iPad, that will perform all the sequencing steps, and thus reduce the sequencing time and cost.

Let's see which technology (or company) dominates the rest.

References:
1) Greenleaf,W.J. and Sidow,A. (2014) The future of sequencing: convergence of intelligent design and market Darwinism. Genome Biol., 15, 303.
2) http://www.genengnews.com/insight-and-intelligenceand153/the-long-and-the-short-of-dna-sequencing/77899725/
3) http://www.fluidigm.com/december-31-2013.html
4) http://allseq.com/knowledgebank/emerging-technologies/genapsys

April 1, 2014

Sequencing GWAS

A nice article on the difference between GWAS with SNPs and that with Sequencing.
http://massgenomics.org/2014/03/gwas-sequencing-realities.html