- mRNA: Messenger RNA. Encodes amino acids sequences.
- tRNA: Transfer RNA. Carries amino acids to ribosome during translation.
- rRNA: Ribosomal RNA. Makes up ribosomes.
- snRNA: Small Nuclear RNA (U1, U2, U4, U5, and U6).
- sRNA: Small RNA. Binds to protein or mRNA targets, and regulates gene expression.
- miRNA: Micro RNA. A family of sRNA that regulates gene expression in a sequence-specific manner.
- siRNA: Small Interfering RNA.
- dsRNA: Double-Stranded RNA
- pre-mRNA: Precursor mRNA. Contains both exons and introns.
- mature mRNA: All introns removed.
December 14, 2014
Types of RNA
November 15, 2014
Useful links describing statistical concepts
October 16, 2014
Useful links for programming
- git push generates error: stackoverflow.
- git branching and merging: git-scm
- istall omp.h in mac: stackoverflow
September 26, 2014
September 18, 2014
September 2, 2014
Human Transcription Factor Databases
I was searching for all human transcription factors (TFs). I found the following databases.
- TRANSFAC (link): It seems to be the most comprehensive database for eukaryotic TFs. It has both public and professional versions. As expected, the professional version contains more data. The first paper on TRANSFAC published in 2000 received more than 1000 citations so far.
- DBD (link): It contains predicted transcription factors from whole genomes. The first paper on DBD published in 2008 received ~140 citations so far.
- AnimalTFDB (link): As the name suggests, it contains animal transcription factors. The first paper on AnimalTFDB published in 2012 received ~25 citations so far.
The following discussions were helpful to know about these databases:
August 31, 2014
Useful links for python programmers
- Command line arguments parsing: link argparse package argparse tutorial
- Input/output: link
- Count occurrence of characters in a string: link
- Install PyDev: link
- FDR correction: link
- Send email: link1 link2
- yield explanation: stackoverflow
- Parallel processing: sebastianraschka.com
- Shared memory used by multiple processes: stackoverflow1 stackoverflow2 stackoverflow3
- Install virtualenv without 'sudo' permission: stackoverflow
pip install --user virtualenv
Plots in Python:
- Plots: matplotlib
- Error in plot when DISPLAY is undefined: stack overflow.
- Drawing distribution line using histogram. stack overflow.
August 30, 2014
Useful ubuntu help links for beginners
- Show only the current directory in terminal: link
- Vi Cheat Sheet: link link2
- Select-Cut-Copy-Paste text in vi: link
- Set vi indentation for python or other plugins: uncomment a few lines in /etc/vim/vimrc.
- Switch windows of the current application: link
- Install/uninstall programs locally without any administrative (sudo) privilege: link link
- Install Eclipse from command line: link
- keep linux process running after ssh session expired: serverfault IBM tutorial
- Unix/Linux cheat sheet: fosswire.com
- Keyboard does not respond after resuming: restart lightdm. You may add the following script in a file in /etc/pm/sleep.d/ which will be called on every resume. askubuntu1 askubuntu2 askubuntu3
case "${1}" in
resume|thaw)
sudo service lightdm restart
;;
esac
July 20, 2014
Databases in Bioinformatics - 1
Databases are nowadays an indispensable part of Biology, and of course, of Bioinformatics. Online bioinformatics databases boomed in the last decade. It is impossible for a person to know about all of them. Yet, there are some important databases every bioinformatician should know. Dr. Bob Lessick Associate Director, Center for Biotechnology Education, Johns Hopkins University, mentioned some of those databases in an online course, Bioinformatics: Life Sciences on Your Computer at Coursera. This blog summarizes databases taught in the course.
1. Pubmed (http://pubmed.gov):
Pubmed is a free database of scientific publications (references and abstracts) on life sciences and biomedical topics. It is hosted by the US National Library of Medicine (NLM) at the National Institutes of Health (NIH). It contains more than 23 million citations from biomedical literature. You can do free-text search as well as advanced search. Following are some example queries.
- Tan AC [author] (Tan AC [au]) : finds papers authored by persons whose last name is Tan and the initials are AC.
- Tan AC [au] AND plos one [journal] : finds AC Tan's papers published in the PloS One journal.
- RNAi [title] AND mello [au] : finds papers authored by Mello, with RNAi in the title.
- immunoglo* : finds papers with words starting with immunoglo. Note: * can be put only at the end of the query.
- "last 10 days" [edat] AND nature [journal]: finds nature papers entered into pubmed in last 10 days.
- 2014/02 [pdat] AND nature[journal]: finds nature papers published in Feb, 2014.
- 2014/02:2014/03 AND nature[journal]: finds nature papers published in Feb, 2014.
- (DNA[title] OR RNA[title] ) AND 2014/02:2014/04[pdat] AND science[journal] : finds science papers published between Feb 2014 and April 2014 which have either DNA or RNA in their titles.
2. MeSH (http://www.ncbi.nlm.nih.gov/mesh/)
MeSH stands for Medical Subject Headings. It contains controlled vocabulary for medical fields. Multiple words may mean the same phenomena. For example, both P53 and TP53 mean the same gene. MeSH gives specialized vocabularies for this term. And if you search with that vocabulary in PubMed, you'lll get all the papers related to that term, no matter how they are spelled in the manuscript. If you search in MeSH database, you'd get its controlled term - "Genes, p53". Now you can build queries for PubMed like below.
- "Genes, p53" [MeSH]
- "Genes, p53" [MeSH] AND nature [journal] AND 2013 [pdat]
Nucleotide is a database of sequences collected from diffrent sources. You can find various kinds of information about genomes, genes, transcripts, etc.
The following figure shows the top portion of a BRCA1 transcript.
- Locus contains several information - Accession number, gene length, type (mRNA means it is a splice RNA), genome type (linear or circular), type of organism (PRI = primates), last modification date.
- Every time the gene sequence is changed, its version number is added with GI id).
- Here, the sequence is from Homo Sapiense organism.
- Publication about this sequence is listed in the reference section.
- You can get the sequence in FASTA format, by clicking on "FASTA" link placed below the title. Note: the sequence is also shown in Gene Bank format at the end of this record page.
- You can also see the version history and compare those from the "Display Settings" menu.
- Each feature is followed by a location in the sequence. If you click on a feature title in the left column, the corresponding sequence will be highlighted.
- Here, one exon is located at position 1 to 213.
- CDS, the coding region including the start codon and stop codon, is an important feature. The translated amion acid sequence is also available here.
July 19, 2014
Prepare a workable Ubuntu desktop computer
I started using computer with the Windows operating system. I started with the ancient Windows-95 back in 1999. I must accept that Windows is pretty easy to learn. But, the bad thing was, we used cracked versions, not the copyrighted ones. Being a poor country, copyright violations were (and still are) very common in Bangladesh. I was not an exception. But I felt guilty inside. I was not aware of Linux at the beginning. I was thrilled when I knew about Linux, a free open-source operating system: I don't have to violate law anymore! But then I found, to my utter dismay, that the system is very very very complex. The installation process, software installation, and basically everything was far more complex than Windows. Graphical interface was very poor; you have to write commands to run your programs; you have to memorize syntax. A scarcity of software made the situation worse. I failed not run mp3 files after trying for a day! Ultimately, I came back to Windows along with the guilty feeling.
Now, Linux has improved very much. It is almost as easy to learn as Windows; sometimes it is even easier than Windows. Especially, Ubuntu has improved a great deal. In the meantime, I also got rid of my command-line fear. Internet has also made software easily available. Few days ago, I thought, why don't I start using Linux now? Although I now use copyrighted version of Windows, I still feel a soft corner Linux philosophy - free and open source. So, I decided to move to Ubuntu.
Ubuntu is now mature enough. It is really easy to use these day. If you want to use it, may be to avoid the guilty feeling of law violation, you just need to have an honest will, and a good internet connection. That's it!
Here I note down a few instructions I used to prepare a workable Ubuntu desktop computer for me.
Now, Linux has improved very much. It is almost as easy to learn as Windows; sometimes it is even easier than Windows. Especially, Ubuntu has improved a great deal. In the meantime, I also got rid of my command-line fear. Internet has also made software easily available. Few days ago, I thought, why don't I start using Linux now? Although I now use copyrighted version of Windows, I still feel a soft corner Linux philosophy - free and open source. So, I decided to move to Ubuntu.
Ubuntu is now mature enough. It is really easy to use these day. If you want to use it, may be to avoid the guilty feeling of law violation, you just need to have an honest will, and a good internet connection. That's it!
Here I note down a few instructions I used to prepare a workable Ubuntu desktop computer for me.
Basic Installation
- Download Ubuntu. I downloaded the 32-bit version. [url]
- Create a bootable usb drive. (You may also create a DVD) [url]
- Plug-in the usb (or DVD) and (re)start your computer.
- Set your computer to boot from USB (or DVD). You have to choose appropriate boot order. The priority of USB (or DVD) has to be higher than that of hard disk. [url]
- Install Ubuntu. [url]
Software Installations
You can install common software from Ubuntu Software Center using graphical user interface (gui) easily. However, sometimes you may need to write commands in Terminal (Press Ctrl+Alt+T). Personally, I prefer terminal to gui. So, here I note down the commands to write in terminal.
- Avro. [url]
- Skype.
- Download: [url]
- Install command: sudo dpkg -i skype-ubuntu-precise_4.2.0.11-1_i386.deb
- Sound troubleshooting: [url]
- Background noise: System Settings > Sound > Input. Make sure your recording device is not amplified.
- To run skype with pulseaudio: PULSE_LATENCY_MSEC=60 skype
- Google Chrome
- Download: [url]
- Install command: sudo dpkg -i google-chrome-stable_current_i386.deb
- Dictionary plug-in in Chrome [url]
- VLC media player [url]
- Install command: sudo apt-get install vlc
- Git
- sudo apt-get install git
- git config --global user.name "User Name"
- git config --global user.email "UserEmail@example.com"
- Apache Web Server [url]
- sudo apt-get install apache2
- check by hitting in browser: http://localhost/
- PHP
- sudo apt-get install php5
- sudo apt-get install libapache2-mod-php5
- To check installation, create a small php page (test.php) in /var/www/html
<?php
phpinfo();
?> - Finally, hit in web browser (mozilla/chrome) with http://localhost/test.php. See this for detail configuration.
- Latex [url]
- TexStudio: sudo apt-get install texstudio
- Extra packages: sudo apt-get install texlive-latex-extra
- R. [url]
- RStudio.
- Download: [url]
- Install command: sudo dpkg -i rstudio-0.98.945-i386.deb
- Teamviewer [url]
- Download: wget http://download.teamviewer.com/download/teamviewer_linux.deb
- sudo dpkg -i teamviewer_linux.deb
- Copy [url]
- wget https://copy.com/install/linux/Copy.tgz
- sudo tar -xvpzf Copy.tgz -C /etc
- cd /etc/copy/
- ./x86/CopyAgent
- Bittorent
- Install command: sudo apt-get install bittorrent
- MySQL:
- sudo apt-get install mysql-server
- sudo apt-get install mysql-workbench
- SQLite:
- sudo apt-get install sqlite
- sudo apt-get install sqlitebrowser
Tips
- Shortcut to open a terminal: Ctrl + Alt + T
- Shortcut to copy-paste Terminal: Ctrl+Shift+C / Ctrl+Shift+V
- Shortcut to toggle desktop: Ctrl+Super+D
- Keyboard shortcuts: [url]
- Install/Uninstall *.deb files. [url]
- sudo dpkg -i package_file.deb
- Fix dependency error / installation error:
- sudo apt-get -f install
- Disable certificate authentication for wifi connection:
- Open /etc/NetworkManager/system-connections/YOUR-CONNECTION file.
- Edit one configuration: system-ca-certs=false
- If you cannot edit a text file, you might need administrator permission. You can do so by opening gedit (text editor) in admin mode, and then edit the file.
- command: sudo gedit
- Close an unresponsive program: [url]
- Write a command in terminal: xkill
- You mouse will look like 'x' and click on the unresponsive program window.
June 24, 2014
An Online Bioinformatics Curriculum
Bioinformatics is no longer a brand-new research area. It has already passed more than a decade. Now, you need to be expert in diverse disciplines to be a successful bioinformatician. Knowledge in biology, biochemistry, genetics, computer science, mathematics, statistics, etc. are essential. Usually, new researchers do not have expertise in all these fields. And, it is often impractical to take formal courses in undergraduate or graduate studies. Free online course curriculum can help a great deal here. David B. Searls compiled a comprehensive online bioinformatics curriculum [1]. He listed out free online course offered by different universities or organizations. You may like to take courses from here which you think would improve yourself.
Reference:
[1] Searls DB (2012) An Online Bioinformatics Curriculum. PLoS Comput Biol 8(9): e1002632. doi:10.1371/journal.pcbi.1002632 [link]
May 13, 2014
Sequence Alignments and Seed Design
Sequence alignment is an important task in genome analysis. It searches for similarity in DNA, RNA or Protein that may be a result of functional, structural or evolutionary relationships between sequences. Although the scientific community is researching on this topic for decades, the problem is not yet solved. Because, sequences are extremely large - millions or billions of bases or amino acids. Due to run-time complexity, researchers typically uses heuristics to search for similarity. There is no way to measure if the current heuristics can find all or most of the true alignments. Frith & Noe showed that improved search heuristics can find new alignments [1]. What heuristics do they use?
Before going to the heuristics, proposed by Frith and Noe, let us briefly look at the existing approaches. Existing search methods generally follow a seed-and-extend approach. They first find short matches (seed), and then search for high-scoring alignments near each seed. Here, the sensitivity of alignment (quality to find all alignments) depends on seed design. Better seed design can result into better sensitivity.
Several kinds of seeds exist in current literature. The simplest seed is an exact match of a given length. A spaced seeds allow mismatches at certain positions. For example, 1110111101 is a seed that looks for a match of length 10, with possible mismatch at position 4 and 9. Transition-constrained seeds allow constrained mismatches at certain positions. Here unlike spaced seeds, they do not allow any kind of mismatch, rather they allow only transitions (A↔G, or T↔C), not transversions. For example, 11T011110T is a seed that looks for a 10-length match with possible mismatch at position 4 & 9 and possible transitions at position 5 & 10. There is a biological explanation behind allowing transitions. Transitions do not change the chemical structure drastically (no. of rings are same), and are less likely to substitute amino acids [2]. Transitions thus do not change the functionality drastically, while transversions usually do. As a result, transitions stay in the sequence as silent substitutions. In fact, transitions are more frequent than transversions, even though the random probability of transversions are double than that of transitions.
Both spaced and transition-constrained seeds look for a fixed length match. In contrast, adaptive seeds can have variable length. Adaptive seeds are lengthened until the frequency of matches in the target sequence becomes less than or equal to a threshold [3]. Let, a fixed length seed matched at 1 million positions of the target sequence. Then we have to perform the expensive alignment steps 1 million times. Adaptive seeds lengthen the seed size to reduce match frequency, and thus improve runtime. Sparse seeds can also be used to reduce match frequency. Sparse seeds do not look for a match at every position; instead they put a regular interval between each starting points (for e.g., search at every second or third positions).
In general, each parameter of seed design has a reciprocal effect on sensitivity and runtime. For example, small seeds have higher sensitivity but longer runtime than those of long seeds. Thus, it can be a good idea to observe the effect of different parameters. Frith & Noe did exactly the same thing in [1]; they analyzed the performance of different parameters using sensitivity and runtime. The following figure summarizes theirs results.
Frith & Noe found that transition-constrained seeds can be useful to find new alignments (see part-C in the above figure). They carefully designed seeds using different transition-transversion ratios. The transition-transversion ratio between human and dog is 3:2. While previous researches mostly used 1:1 ratio, Firth et. al. used 3:2. This generally reduces the error, keeping the runtime same. The authors found about 20,000 new alignments in this approach between human and mouse. All of these definitely do not signify functional or evolutionary relationships. The authors speculate a high probability to find new significant alignments. As most of the new alignments were unaligned, they speculate high rate of orthologs. This claim needs more support.
References:
[1] Frith,M.C. and Noé,L. (2014) Improved search heuristics find 20 000 new alignments between human and mouse genomes. Nucleic Acids Res., 42, e59.
[2] Carr, S.M. (2013), Transition versus Transversion mutations, https://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html, Accessed on May 12, 2014.
[3] Kiełbasa,S.M. et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res., 21, 487–93.
Before going to the heuristics, proposed by Frith and Noe, let us briefly look at the existing approaches. Existing search methods generally follow a seed-and-extend approach. They first find short matches (seed), and then search for high-scoring alignments near each seed. Here, the sensitivity of alignment (quality to find all alignments) depends on seed design. Better seed design can result into better sensitivity.
Several kinds of seeds exist in current literature. The simplest seed is an exact match of a given length. A spaced seeds allow mismatches at certain positions. For example, 1110111101 is a seed that looks for a match of length 10, with possible mismatch at position 4 and 9. Transition-constrained seeds allow constrained mismatches at certain positions. Here unlike spaced seeds, they do not allow any kind of mismatch, rather they allow only transitions (A↔G, or T↔C), not transversions. For example, 11T011110T is a seed that looks for a 10-length match with possible mismatch at position 4 & 9 and possible transitions at position 5 & 10. There is a biological explanation behind allowing transitions. Transitions do not change the chemical structure drastically (no. of rings are same), and are less likely to substitute amino acids [2]. Transitions thus do not change the functionality drastically, while transversions usually do. As a result, transitions stay in the sequence as silent substitutions. In fact, transitions are more frequent than transversions, even though the random probability of transversions are double than that of transitions.
Both spaced and transition-constrained seeds look for a fixed length match. In contrast, adaptive seeds can have variable length. Adaptive seeds are lengthened until the frequency of matches in the target sequence becomes less than or equal to a threshold [3]. Let, a fixed length seed matched at 1 million positions of the target sequence. Then we have to perform the expensive alignment steps 1 million times. Adaptive seeds lengthen the seed size to reduce match frequency, and thus improve runtime. Sparse seeds can also be used to reduce match frequency. Sparse seeds do not look for a match at every position; instead they put a regular interval between each starting points (for e.g., search at every second or third positions).
In general, each parameter of seed design has a reciprocal effect on sensitivity and runtime. For example, small seeds have higher sensitivity but longer runtime than those of long seeds. Thus, it can be a good idea to observe the effect of different parameters. Frith & Noe did exactly the same thing in [1]; they analyzed the performance of different parameters using sensitivity and runtime. The following figure summarizes theirs results.
Frith & Noe found that transition-constrained seeds can be useful to find new alignments (see part-C in the above figure). They carefully designed seeds using different transition-transversion ratios. The transition-transversion ratio between human and dog is 3:2. While previous researches mostly used 1:1 ratio, Firth et. al. used 3:2. This generally reduces the error, keeping the runtime same. The authors found about 20,000 new alignments in this approach between human and mouse. All of these definitely do not signify functional or evolutionary relationships. The authors speculate a high probability to find new significant alignments. As most of the new alignments were unaligned, they speculate high rate of orthologs. This claim needs more support.
References:
[1] Frith,M.C. and Noé,L. (2014) Improved search heuristics find 20 000 new alignments between human and mouse genomes. Nucleic Acids Res., 42, e59.
[2] Carr, S.M. (2013), Transition versus Transversion mutations, https://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html, Accessed on May 12, 2014.
[3] Kiełbasa,S.M. et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res., 21, 487–93.
May 1, 2014
Homologs, Orthologs, and Paralogs
Homologs, Orthologs, and Paralogs - these 3 terms are conceptually related. It is necessary to understand the distinction among them.
Homology means that two genes are related by descent i.e. they have a common ancestral DNA sequence. Homology can be divided into two parts - Orthology and Paralogy. Orthologs are results of speciation, while Paralogs are results of gene duplication.
Orthology and Paralogy can easily be determined from the ancestral tree. You have to track along the vertical line of descent and find the place where the pair of genes join. If they join at an upside-down 'Y' node, then they are orthologs. In contrast, if they join at a horizontally connected node, then they are paralogs.
Here is an example [1]. In Fig. (a):
Fig. (b) and Fig. (c) are actually same. (c) is actually a detailed illustration of (b). Here:
Homology means that two genes are related by descent i.e. they have a common ancestral DNA sequence. Homology can be divided into two parts - Orthology and Paralogy. Orthologs are results of speciation, while Paralogs are results of gene duplication.
Orthology and Paralogy can easily be determined from the ancestral tree. You have to track along the vertical line of descent and find the place where the pair of genes join. If they join at an upside-down 'Y' node, then they are orthologs. In contrast, if they join at a horizontally connected node, then they are paralogs.
Here is an example [1]. In Fig. (a):
- A1 has 5 Orthologs - B1, B2, C1, C2, C3, as all five orthologs join with A1 at an inverted 'Y' node, where speciation occurred.
- B1 & B2 are paralogs, as they meet horizontally, where gene duplication took place.
- B1 & C1 are orthologs.
- C1, C2 & C3 are paralogs to each other.
Fig. (b) and Fig. (c) are actually same. (c) is actually a detailed illustration of (b). Here:
- A1 & A2 (and also B1 & B2) are orthologs.
- A1 & B1; A1 & B2; A2 & B1; A2 & B2 are all paralogs.
Identification of orthologs can play a significant role to determine evolutionary history. Usually, orthologs have the same function as their common ancestor, while paralogs do not. Bioinformaticians often take advantage of this behavior to differentiate orthologs from paralogs. But, functional similarity (or dissimilarity) does not necessarily imply orthologs (or paralogs).
Reference:
Reference:
- Jensen,R.A. (2001) Orthologs and paralogs - we need to get it right. Genome Biol., 2(8). [link]
Other Sources:
April 28, 2014
Deciphering Nature's Alphabet
Few days ago, GenomeTV uploaded an inspiring documentary on major discoveries of molecular biology. Each of the five parts contains stories behind a set of major discoveries. Great scientists themselves tell their own history, which is amazing. Here is the playlist in youtube: http://www.youtube.com/playlist?list=PL1ay9ko4A8smA2OjHMSeJPqrhJwYaH-Aa.
April 25, 2014
Increase the readability of your writing
I came across an online application, Hemmingway, to check scientific writings. It highlights the sentences hard to understand, and also suggests scope of improvement. The suggestions are, in fact, so trivial. Yet, we do not get it right. Here is the link: http://www.hemingwayapp.com/.
I found that this application does not show the grammatical errors. Yet, it does provide good suggestions to improve the readability. I tested the readability of my last blog, and then made necessary modifications to get rid of the errors. I am satisfied with the quality of service.
Thanks to Stephen Turner for suggesting this tool.
I found that this application does not show the grammatical errors. Yet, it does provide good suggestions to improve the readability. I tested the readability of my last blog, and then made necessary modifications to get rid of the errors. I am satisfied with the quality of service.
Thanks to Stephen Turner for suggesting this tool.
Bioinformatics Algorithms
Finally, I have completed the Bioinformatics Algorithms (Part-1) course at coursera. I enrolled this course on last November, 2013. But, I could not finish it due to workload both in daily life and in the course. The course design is favorable for a newcomer to bionformatics. The instructors described important problems in genomics, and gave the intuition behind the solutions. They explained the solutions step by step so that a student can learn how a bioinformatician should approach to a problem. They stated the most naive solution, then identified its drawbacks, and improved it. They also showed how biology-informed models can improve the solutions.
A good thing to note that it stated some of the open problems in bioinformatics at the end of each chapters. Those problems are hard. Yet, they give a glimpse of the difficulty level of current research topics in bioinformatics. After completing this course, I am now confident that I know at least a little about bioinformatics. I realized that I have to study hard, especially I have to study biology, to make me competent to deal with biological problems. I realized once again that the key to be successful in this area is the capability to connect biology with computing.
The assignments are a vital part of the course. These assignments forced the students to DO the job, instead of just knowing it. I solved each assignment using either R or Python. I started with R, as I was familiar with it. Later, found that R is slow, particularly if you cannot avoid looping (using for). I often struggled to find a solution within the 5-minutes duration given for a problem. Then I started learning and using Python. To my surprise, I found that Python is pretty fast and its syntaxes are quite powerful. Especially, I liked list comprehension and lambda expression in Python.
I also started using github to store my codes on a regular basis. Although I had a github account before, I seldom used it. This time, I stored all the solutions at github. Here is the repository.
April 2, 2014
Short read sequencing or Long read sequencing?
Genome sequencing is a hot topic these days. Currently, the popular method of sequencing is to generate millions of short reads, typically 50 to 150 nucleotides long, and then assemble the reads in computational approach. Illumina, almost having a monopoly in sequencing business, follows this strategy. However, this strategy has some drawbacks. For example, it reads genome from multiple cells, and the biological signals in those cells are averaged to generate a consensus sequence. Consequently, it cannot identify the molecular-level biological differences. Moreover, this strategy does not work well with repetitive sequences or heterozygous sequences.
In contrast, long reads can be used for sequencing. These reads can be 100 times longer that short reads. Thus the long reads have fundamentally more information than short ones. Long reads can help uniquely map the reads in complex regions including repetitive elements. However, long reads currently suffers from an elevated error rate, about 15%. That means, one in every 7 or 8 bases is incorrect. Due to this limitation, long reads alone are yet not suitable for sequencing. However, a combination of short and long reads can perform much substantially better than any of the two methods.
Pacific Biosciences, a biotechnology company, focuses on long reads. They are trying to improve the error correction algorithm so that sequencing can be performed only from the long reads, without using the short reads. That would be a great achievement, as it would reduce the cost, and also enable identification of heterozygous and repetitive elements. Thus, we may expect that the the monopoly of Illumina would be reduced.
Another biotech company, Oxford Nanopore, is also in the race. They follow a different technology. They use the characteristic conductance change when single-stranded DNA passes through or near the nanopore, a small hole of the order of 1 nanometer in internal diameter. This strategy also produces long reads from single cell. Although this approach suffers from a high error rate, it has been shown in an experiment that more than 80% of the reads had perfect 50-nucleotides sections. This is impressive. If a proper error correction algorithm can be devised, Oxford Nanopore can be beat the dominance of Pacific Biosciences in the long-read field.
To make the scenario more interesting, GynapSys, another biotech company, aims at developing a small all-electronic instrument, like an iPad, that will perform all the sequencing steps, and thus reduce the sequencing time and cost.
Let's see which technology (or company) dominates the rest.
References:
1) Greenleaf,W.J. and Sidow,A. (2014) The future of sequencing: convergence of intelligent design and market Darwinism. Genome Biol., 15, 303.
2) http://www.genengnews.com/insight-and-intelligenceand153/the-long-and-the-short-of-dna-sequencing/77899725/
3) http://www.fluidigm.com/december-31-2013.html
4) http://allseq.com/knowledgebank/emerging-technologies/genapsys
April 1, 2014
Sequencing GWAS
A nice article on the difference between GWAS with SNPs and that with Sequencing.
http://massgenomics.org/2014/03/gwas-sequencing-realities.html
http://massgenomics.org/2014/03/gwas-sequencing-realities.html
February 24, 2014
Configure core.editor in git to enter tag messages
Some git commands (e.g., git tag) need a message from the user, and git automatically opens a text editor to write the message. However, if your editor (core.editor) is not configured, then you may end up with an error message like below.
Here are the git commands to configure a text editor depending on OS.
In Linux: git config --global core.editor "vim"
In Windows: git config --global core.editor "'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
References: ref1 ref2
fatal: no tag message?
Here are the git commands to configure a text editor depending on OS.
In Linux: git config --global core.editor "vim"
In Windows: git config --global core.editor "'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
References: ref1 ref2
February 14, 2014
How to find an euler cycle from a balanced directed graph in R
I was trying to find an euler cycle from a balanced (for every node, indegree=outdegree) directed graph in R. Firstly, I tried the PairViz package. However, it works only for undirected even (every node has even degree) graph. I posted in StackOverflow too, but I found that the solution is not readily available. So, I implemented the algorithm as below.
Verification Code:
library(graph) eulerCycle <- function(g, start=NULL){ eulerCycle <- c() curNode <- ifelse(is.null(start), nodes(g)[1], start) while(!is.na(curNode)){ cycle <- curNode while(!is.na(nextNode <- randomWalkNext(g, curNode))){ g <- removeEdge(graph=g, from=curNode, to=nextNode) cycle <- append(cycle, nextNode) curNode <- nextNode } if(length(eulerCycle)==0){ eulerCycle <- cycle } else{ insertIndex <- which(eulerCycle==cycle[1])[1] eulerCycle <- append(eulerCycle,after=insertIndex,values=cycle[-1]) } curNode <- getAnUnexploredNode(g, nodes=eulerCycle) } return(eulerCycle) } getAnUnexploredNode <- function(g, nodes){ degrees <- degree(g, Nodes=nodes) nodeIndexes <- which(degrees$outDegree+degrees$inDegree>0) node <- NA if(length(nodeIndexes)>0){ node <- nodes[nodeIndexes[1]] } return(node) } randomWalkNext <- function(g, from){ outEdges <- edges(object=g, which=from)[[1]] nextNode <- NA if(length(outEdges)>0){ nextNode <- outEdges[1] } return(nextNode) }
Verification Code:
> g <- new("graphNEL", nodes=as.character(1:10), edgemode="directed") > g <- addEdge(graph=g, from="1", to="10") > g <- addEdge(graph=g, from="2", to="1") > g <- addEdge(graph=g, from="2", to="6") > g <- addEdge(graph=g, from="3", to="2") > g <- addEdge(graph=g, from="4", to="2") > g <- addEdge(graph=g, from="5", to="4") > g <- addEdge(graph=g, from="6", to="5") > g <- addEdge(graph=g, from="6", to="8") > g <- addEdge(graph=g, from="7", to="9") > g <- addEdge(graph=g, from="8", to="7") > g <- addEdge(graph=g, from="9", to="6") > g <- addEdge(graph=g, from="10", to="3") > > ec <- eulerCycle(g, start="6") > print(ec) [1] "6" "5" "4" "2" "1" "10" "3" "2" "6" "8" "7" "9" "6"
Update:
I have published a R package at CRAN, euler, to find eulerian paths from graphs.January 24, 2014
Interface between R and PHP or other languages
RServe (link1, link2) is a useful R package to interface R with other languages like java, php, etc. It is basically a TCP/IP server. It creates tcp socket connections through which other the outside world can talk to R.
The good thing is it manages multiple connections in a clean way. It creates a separate workspace and a directory for every connections. So, each connection is independent of others. Another good thing is several client-side implementations are available including C/C++, Java, PHP.
Recently, I used it with PHP. It works great. You may start RServe as a daemon (using the function Rserve()) or from your R session with your current session available to all connections (using the function run.Rserve()).
While I was working on integrating R with PHP via Rserve, I was struggling to have write permission in the Rserve folder. I found that both Rserve and Apche Server have to be run by the same user and group. Both Apache and Rserve has to be configured for this purpose.
Rserve config:
1. Find the uid and gid of the user (here ashis:ashis)
Finally, You have to restart both apache and Rserve.
The good thing is it manages multiple connections in a clean way. It creates a separate workspace and a directory for every connections. So, each connection is independent of others. Another good thing is several client-side implementations are available including C/C++, Java, PHP.
Recently, I used it with PHP. It works great. You may start RServe as a daemon (using the function Rserve()) or from your R session with your current session available to all connections (using the function run.Rserve()).
While I was working on integrating R with PHP via Rserve, I was struggling to have write permission in the Rserve folder. I found that both Rserve and Apche Server have to be run by the same user and group. Both Apache and Rserve has to be configured for this purpose.
Apache Config:
1. Edit /etc/apache2/envvars
export APACHE_RUN_USER=ashis
export APACHE_RUN_GROUP=ashis
export APACHE_RUN_GROUP=ashis
2. You may have to change the ownership of the /var/locks/apache2 folder.
sudo chown -R ashis:ashis /var/locks/apache2/
id ashis
2. Edit /etc/Rserve.conf file
uid UID_OF_USER
gid GID_OF_USER
Finally, You have to restart both apache and Rserve.
Subscribe to:
Posts (Atom)