April 28, 2014

Deciphering Nature's Alphabet

Few days ago, GenomeTV uploaded an inspiring documentary on major discoveries of molecular biology. Each of the five parts contains stories behind a set of major discoveries. Great scientists themselves tell their own history, which is amazing. Here is the playlist in youtube: http://www.youtube.com/playlist?list=PL1ay9ko4A8smA2OjHMSeJPqrhJwYaH-Aa.

April 25, 2014

Increase the readability of your writing

I came across an online application, Hemmingway, to check scientific writings. It highlights the sentences hard to understand, and also suggests scope of improvement. The suggestions are, in fact, so trivial. Yet, we do not get it right. Here is the link: http://www.hemingwayapp.com/.

I found that this application does not show the grammatical errors. Yet, it does provide good suggestions to improve the readability. I tested the readability of my last blog, and then made necessary modifications to get rid of the errors. I am satisfied with the quality of service.

Thanks to Stephen Turner for suggesting this tool.

Bioinformatics Algorithms

Finally, I have completed the Bioinformatics Algorithms (Part-1) course at coursera. I enrolled this course on last November, 2013. But, I could not finish it due to workload both in daily life and in the course. The course design is favorable for a newcomer to bionformatics. The instructors described important problems in genomics, and gave the intuition behind the solutions. They explained the solutions step by step so that a student can learn how a bioinformatician should approach to a problem. They stated the most naive solution, then identified its drawbacks, and improved it. They also showed how biology-informed models can improve the solutions.

A good thing to note that it stated some of the open problems in bioinformatics at the end of each chapters. Those problems are hard. Yet, they give a glimpse of the difficulty level of current research topics in bioinformatics. After completing this course, I am now confident that I know at least a little about bioinformatics. I realized that I have to study hard, especially I have to study biology, to make me competent to deal with biological problems. I realized once again that the key to be successful in this area is the capability to connect biology with computing.

The assignments are a vital part of the course. These assignments forced the students to DO the job, instead of just knowing it. I solved each assignment using either R or Python. I started with R, as I was familiar with it. Later, found that R is slow, particularly if you cannot avoid looping (using for). I often struggled to find a solution within the 5-minutes duration given for a problem. Then I started learning and using Python. To my surprise, I found that Python is pretty fast and its syntaxes are quite powerful. Especially, I liked list comprehension and lambda expression in Python.

I also started using github to store my codes on a regular basis. Although I had a github account before, I seldom used it. This time, I stored all the solutions at github. Here is the repository.



April 2, 2014

Short read sequencing or Long read sequencing?

Genome sequencing is a hot topic these days. Currently, the popular method of sequencing is to generate millions of short reads, typically 50 to 150 nucleotides long, and then assemble the reads in computational approach. Illumina, almost having a monopoly in sequencing business, follows this strategy. However, this strategy has some drawbacks. For example, it reads genome from multiple cells, and the biological signals in those cells are averaged to generate a consensus sequence. Consequently, it cannot identify the molecular-level biological differences. Moreover, this strategy does not work well with repetitive sequences or heterozygous sequences.

In contrast, long reads can be used for sequencing. These reads can be 100 times longer that short reads. Thus the long reads have fundamentally more information than short ones. Long reads can help uniquely map the reads in complex regions including repetitive elements. However, long reads currently suffers from an elevated error rate, about 15%. That means, one in every 7 or 8 bases is incorrect. Due to this limitation, long reads alone are yet not suitable for sequencing. However, a combination of short and long reads can perform much substantially better than any of the two methods.

Pacific Biosciences, a biotechnology company, focuses on long reads. They are trying to improve the error correction algorithm so that sequencing can be performed only from the long reads, without using the short reads. That would be a great achievement, as it would reduce the cost, and also enable identification of heterozygous and repetitive elements. Thus, we may expect that the the monopoly of Illumina would be reduced.

Another biotech company, Oxford Nanopore, is also in the race. They follow a different technology. They use the characteristic conductance change when single-stranded DNA passes through or near the nanopore, a small hole of the order of 1 nanometer in internal diameter. This strategy also produces long reads from single cell. Although this approach suffers from a high error rate, it has been shown in an experiment that more than 80% of the reads had perfect 50-nucleotides sections. This is impressive. If a proper error correction algorithm can be devised, Oxford Nanopore can be beat the dominance of Pacific Biosciences in the long-read field.

To make the scenario more interesting, GynapSys, another biotech company, aims at developing a small all-electronic instrument, like an iPad, that will perform all the sequencing steps, and thus reduce the sequencing time and cost.

Let's see which technology (or company) dominates the rest.

References:
1) Greenleaf,W.J. and Sidow,A. (2014) The future of sequencing: convergence of intelligent design and market Darwinism. Genome Biol., 15, 303.
2) http://www.genengnews.com/insight-and-intelligenceand153/the-long-and-the-short-of-dna-sequencing/77899725/
3) http://www.fluidigm.com/december-31-2013.html
4) http://allseq.com/knowledgebank/emerging-technologies/genapsys

April 1, 2014

Sequencing GWAS

A nice article on the difference between GWAS with SNPs and that with Sequencing.
http://massgenomics.org/2014/03/gwas-sequencing-realities.html