March 21, 2012

Multiple testing correction in genomics

For hypothesis testing, P-value is a highly used metric. The target is to have a very low (less than significance level) P-value i.e. to lower the probability of getting the observation by chance. Few demonstrations are available in the web.



One important point about P-value is that it is statistically valid when single score is computed. But in genomics, usually thousands of genes or millions of SNP or other scores are tested, which means that the calculated P-value is the probability of observation by chance using large number of scores. So, P-value threshold has to be justified, as it is valid only for one score.

The most widely used method for multiple testing correction is Bonferroni correction, which divides the significance threshold (α) by the number of tests (n). From a Bonferroni adjusted significance threshold α=0.01, we can be sure that none of the scores would be observed by chance from the null hypothesis. This is a usually a too strict adjustment.

Rather than saying that we want to be 99% sure that none of the observed scores is drawn from the null hypothesis, it is frequently sufficient to get a set of scores a little percentage of which may be drawn from the null hypothesis. This is actually the basis of False Discovery Rate (FDR) estimation. For some score threshold t, let Sobs is the number of observed score >= t, and Snull is that of null score >= t, then FDR is defined as

FDR = Sobs / Snull.

A limitation of FDR is further addressed in another metric, q-value which is defined as the minimum FDR attained at or above a given score.

Then the question arises, is Bonferroni correction, which is most widely used, of any use in any circumstance? The answer actually depends on the tradeoff between the costs and benefits associated with false positive and false negative. The guideline is: if
 follow-up analyses depend upon group of scores and a little fixed percentage of error is tolerable, then FDR analysis is appropriate. Otherwise, when if follow-up focus on a single example, then the Bonferroni adjustment is more appropriate.

Reference: Noble, W.S. How does multiple testing correction work? Nature Biotechnology 27, 1135-1137 (2009).

March 20, 2012

Biological Pathway Standard Format

  • Systems Biology Markup Language (SBML)
    • Used mainly for representation of pathways and mathematical models. Currently (Feb 2008) best suited format for structured representation of mathematical modeling and simulations.
  • Proteomics standards Initiative - Molecular Interactions (PSI_MI)
    • Designed for structured representation of experimental evidence information, suct as molecular interactions data.
  • Biological Pathway eXchange (BioPAX)
    • Integrates PSI-MI within a pathway representation format and provides mechanism to store addition information.

Bioinformatics research links

List of "getting started" resources:
http://www.liacs.nl/~hoogeboo/mcb/nature_primer.html

List of biological pathway resources:
http://www.pathguide.org/

March 16, 2012

z-statictics vs t-statistics

z-statistics use z-score which defines z-score as how many standard deviation away from the mean.

z-score = (µ-x)/σ

But, as µ and σ are essentially mean and standard deviation of population, not of sample, those have to be estimated. If, the sample size is large enough (usually at least greater than or equal to 30) µ is estimated as sample mean while σ is estimated with denominator (n-1) instead of n in the mathematical definition of standard deviation to avoid biasing. As I cannot write mathematical equations here, I am giving the link of wiki http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation.

According to central limit theorem, standard deviation of the sample distribution of samples is estimated as s/√n, where s is the sample standard deviation. And then using this σ z-score is calculated and 78-95.4-99.7 rule is applied to calculate the desired probability.

But, when sample size is not large enough (< 30), then the estimation of σ is under-estimated. Consequently, z-score (or z-statistics) does not work. In this case, t-statistics is used.

In a summary, if sample size is large enough (>= 30), use z-statistics, otherwise use t-statistics.

Here is the video from Khan Academy.


Central Limit Theorem

A little formally, probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.

informally, say we have a probability distribution of integers. now, if we randomly take N numbers and then calculate then mean of those taken numbers (say it "sample mean"). Then the distribution of "sample mean" will be a normal distribution (with same mean µ and standard deviation σ/√N ).

As N increases, the standard deviation will decrease and the distribution will be more like a true normal distribution.

A nice explanation from Khan Academy.

March 15, 2012

Common disease - Rare Variant

Common disease - Rare variant (CD-RV) hypothesis is an alternative to Common disease - Common Variant (CD-CV). It says that the disease is caused by multiple strong-effect variants, each of which is found in only a few people. Instead of the common signpost pointing to a common weak-effect variant, it might be pointing to many strong-effect variants. According to CD-RV, a few people have one strong-effect variant (which causes the disease), a few have another, and so on.

This hypothesis does not necessarily claim that common disease cannot have common variant. It rather suggests that rare variants possibly requires careful consideration. Here, whole genome sequencing, instead of SNP polymorphism, can help a lot.

Two important links:
  1. http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1000294
  2. http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000293

Common Disease - Common Variant

Common disease - common variant (CD-CV) hypothesis says that the common disease-causing allele (or variant) will be found in all human populations commonly having that disease.

Common variants (not necessarily disease-causing) are known to exist in coding and regulatory sequence of genes. CD-CV says, some of these variants lead to cause that disease.

Common example is SNP (Single Nucleotide Polymorphism) - single nucleotide base change in DNA. SNP variants tend to be common in different human population. This polymorphism have been valuable as "markers," in search for common variants causing a common disease.

In complex disease, the effect (additive or multiplicative) of a variant at a gene to cause the disease will be very small and it will be evolutionarily neutral, as so many genes influence a complex disease.

শব্দার্থ

Etiology : কারণতত্ত্ব

Odds Ratio

Odds : ratio of Probability(occurring an event) over Probability(not occurring an event)

odds = p / (1-p)

Odds Ratio (OR): The odds ratio is a measure of effect size [1], describing the strength of association or non-independence between two binary data values.

Mathematical definition in terms of group-wise odds : Odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. If the probabilities of the event in each of the groups are p1 (first group) and p2 (second group), then

OR = {p1/(1-p1)} / {p2/(1-p2)}


OR = 1 : The event occurs in both groups equally.
OR > 1 : The event is more likely to occur in first group.
OR < 1 : The event is more likely to occur in second group.

There are other types of mathematical definition in different contexts which are found here.

[1] Effect Size : In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population.

March 14, 2012

Talk - Introduction to Population Genetics (2010)

 

Topics:
  • Human genetic variation / diversity (including continental variation)
  •  Ancestry & race
  • Linkage disequilibrium (with mutation and recombination)

Recombination Hotspot

Recombination hotspots are regions in a genome that exhibit elevated rates of recombination, relative to a neutral expectation. The peak recombination rate within hotspots can be hundreds or thousands of times that of the surrounding region.

60% of crossovers occur only in 10% of the genome. [http://www.youtube.com/watch?v=ZPnLTmJfUu0&feature=g-hist&context=G2ac1660AHT179vAAAAA Time:1:17:40]

March 6, 2012

Basic Terms in population genetics

Population Genetics: Study of naturally occurring genetic differences among organisms.

Genetic Polymorphism: Genetic differences  that are common among organisms of the same species.
Genetic Divergence: Genetic differences that accumulate between species.

So, Population Genetics is the study of genetic polymorphism and divergence.

Gene: Roughly, gene is a genetic term meaning physical entity transmitted from parent to offspring during the reproduction process that influences heredity.

Genotype: Set of genes present in an individual.
Phenotype: Physical or biochemical expression of genotype.

Same genotype can result in different phenotype depending on environmental factors and same phenotype can result for 2 or more genotypes. Although genes do not determine complex phenotype owing to interacting genes and environmental factors, genes do determine molecular phenotypes.

Allele: Genes can exist in different forms or state. The alternative forms of gene are call alleles.


A gene corresponds to a specific sequence of constituents (called nucleotides) along DNA. Different sequences of nucleotides that may occur in a gene represent allele.

Transcription is the process in which sequence of nucleotides present in one DNA strand of a gene is faithfully copied into the nucleotides of RNA molecule. RNA has nucleotides A, U (instead of T in DNA), G & C.

After transcription, certain segments of the RNA transcript are removed by splicing. The eliminated segments are known as introns. The regions between the introns that remained in fully processed RNA are called exons.

In addition to splicing of exons, RNA processing also includes modifications to both ends of the RNA transcript. The fully processed RNA consitutes the messenger RNA (mRNA).


mRNA undergoes translation on ribosomes in the cytoplasm to produce polypeptide. In mRNA, each adjacent group of 3 nucleotides constitutes codon. Codon specifies the corresponding amino acids and subunits in the polypeptide chain.

Genome: The totality of DNA in a cell is the genome.

Within a cell, genes are arranged in linear order along the chromosomes. The position of a gene along the chromosome is called locus. In eukaryote, at any locus, every individual contains 2 alleles - one from mother, other from father. If both alleles are same, then the individual is called homozygous. On the other hand, if they are different, then that is called heterozygous.

Each human reproductive cell contains a complete set of 23 chromosomes. A human chromosome contains averagely 3500 genes. A cell has a genome size of approximately 3x10^9 base pairs.