One important point about P-value is that it is statistically valid when single score is computed. But in genomics, usually thousands of genes or millions of SNP or other scores are tested, which means that the calculated P-value is the probability of observation by chance using large number of scores. So, P-value threshold has to be justified, as it is valid only for one score.
The most widely used method for multiple testing correction is Bonferroni correction, which divides the significance threshold (α) by the number of tests (n). From a Bonferroni adjusted significance threshold α=0.01, we can be sure that none of the scores would be observed by chance from the null hypothesis. This is a usually a too strict adjustment.
Rather than saying that we want to be 99% sure that none of the observed scores is drawn from the null hypothesis, it is frequently sufficient to get a set of scores a little percentage of which may be drawn from the null hypothesis. This is actually the basis of False Discovery Rate (FDR) estimation. For some score threshold t, let Sobs is the number of observed score >= t, and Snull is that of null score >= t, then FDR is defined as
FDR = Sobs / Snull.
A limitation of FDR is further addressed in another metric, q-value which is defined as the minimum FDR attained at or above a given score.
Then the question arises, is Bonferroni correction, which is most widely used, of any use in any circumstance? The answer actually depends on the tradeoff between the costs and benefits associated with false positive and false negative. The guideline is: if
follow-up analyses depend upon group of scores and a little fixed percentage of error is tolerable, then FDR analysis is appropriate. Otherwise, when if follow-up focus on a single example, then the Bonferroni adjustment is more appropriate.
Reference: Noble, W.S. How does multiple testing correction work? Nature Biotechnology 27, 1135-1137 (2009).
No comments:
Post a Comment