May 13, 2014

Sequence Alignments and Seed Design

Sequence alignment is an important task in genome analysis. It searches for similarity in DNA, RNA or Protein that may be a result of functional, structural or evolutionary relationships between sequences. Although the scientific community is researching on this topic for decades, the problem is not yet  solved. Because, sequences are extremely large - millions or billions of bases or amino acids. Due to run-time complexity, researchers typically uses heuristics to search for similarity. There is no way to measure if the current heuristics can find all or most of the true alignments. Frith & Noe showed that improved search heuristics can find new alignments [1]. What heuristics do they use?

Before going to the heuristics, proposed by Frith and Noe, let us briefly look at the existing approaches. Existing search methods generally follow a seed-and-extend approach. They first find short matches (seed), and then search for high-scoring alignments near each seed. Here, the sensitivity of alignment (quality to find all alignments) depends on seed design. Better seed design can result into better sensitivity.

Several kinds of seeds exist in current literature. The simplest seed is an exact match of a given length. A spaced seeds allow mismatches at certain positions. For example, 1110111101 is a seed that looks for a match of length 10, with possible mismatch at position 4 and 9. Transition-constrained seeds allow constrained mismatches at certain positions. Here unlike spaced seeds, they do not allow any kind of mismatch, rather they allow only transitions (AG, or T↔C), not transversions. For example, 11T011110T is a seed that looks for a 10-length match with possible mismatch at position 4 & 9 and possible transitions at position 5 & 10. There is a biological explanation behind allowing transitions. Transitions do not change the chemical structure drastically (no. of rings are same), and are less likely to substitute amino acids [2]. Transitions thus do not change the functionality drastically, while transversions usually do. As a result, transitions stay in the sequence as silent substitutions. In fact, transitions are more frequent than transversions, even though the random probability of transversions are double than that of transitions.

Both spaced and transition-constrained seeds look for a fixed length match. In contrast, adaptive seeds can have variable length. Adaptive seeds are lengthened until the frequency of matches in the target sequence becomes less than or equal to a threshold [3]. Let, a fixed length seed matched at 1 million positions of the target sequence. Then we have to perform the expensive alignment steps 1 million times. Adaptive seeds lengthen the seed size to reduce match frequency, and thus improve runtime. Sparse seeds can also be used to reduce match frequency. Sparse seeds do not look for a match at every position; instead they put a regular interval between each starting points (for e.g., search at every second or third positions).

In general, each parameter of seed design has a reciprocal effect on sensitivity and runtime. For example, small seeds have higher sensitivity but longer runtime than those of long seeds. Thus, it can be a good idea to observe the effect of different parameters. Frith & Noe did exactly the same thing in [1]; they analyzed the performance of different parameters using sensitivity and runtime. The following figure summarizes theirs results.



Frith & Noe found that transition-constrained seeds can be useful to find new alignments (see part-C in the above figure). They carefully designed seeds using different transition-transversion ratios. The transition-transversion ratio between human and dog is 3:2. While previous researches mostly used 1:1 ratio, Firth et. al. used 3:2. This generally reduces the error, keeping the runtime same. The authors found about 20,000 new alignments in this approach between human and mouse. All of these definitely do not signify functional or evolutionary relationships. The authors speculate a high probability to find new significant alignments. As most of the new alignments were unaligned, they speculate high rate of orthologs. This claim needs more support.


References:

[1] Frith,M.C. and Noé,L. (2014) Improved search heuristics find 20 000 new alignments between human and mouse genomes. Nucleic Acids Res., 42, e59.

[2] Carr, S.M. (2013), Transition versus Transversion mutations, https://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html, Accessed on May 12, 2014.

[3] Kiełbasa,S.M. et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res., 21, 487–93.






May 1, 2014

Homologs, Orthologs, and Paralogs

Homologs, Orthologs, and Paralogs - these 3 terms are conceptually related. It is necessary to understand the distinction among them.

Homology means that two genes are related by descent i.e. they have a common ancestral DNA sequence. Homology can be divided into two parts - Orthology and Paralogy. Orthologs are results of speciation, while Paralogs are results of gene duplication.

Orthology and Paralogy can easily be determined from the ancestral tree. You have to track along the vertical line of descent and find the place where the pair of genes join. If they join at an upside-down 'Y' node, then they are orthologs. In contrast, if they join at a horizontally connected node, then they are paralogs.





Here is an example [1]. In Fig. (a):

  • A1 has 5 Orthologs - B1, B2, C1, C2, C3, as all five orthologs join with A1 at an inverted 'Y' node, where speciation occurred.
  • B1 & B2 are paralogs, as they meet horizontally, where gene duplication took place.
  • B1 & C1 are orthologs.
  • C1, C2 & C3 are paralogs to each other.

Fig. (b) and Fig. (c) are actually same. (c) is actually a detailed illustration of (b). Here:

  • A1 & A2 (and also B1 & B2) are orthologs.
  • A1 & B1; A1 & B2; A2 & B1; A2 & B2 are all paralogs.
Identification of orthologs can play a significant role to determine evolutionary history. Usually, orthologs have the same function as their common ancestor, while paralogs do not. Bioinformaticians often take advantage of this behavior to differentiate orthologs from paralogs. But, functional similarity (or dissimilarity) does not necessarily imply orthologs (or paralogs).

Reference:
  1. Jensen,R.A. (2001) Orthologs and paralogs - we need to get it right. Genome Biol., 2(8). [link]
Other Sources: