A Significance Threshold Criterion for Large-scale Multiple Tests

Cheng Cheng, Ph.D.
St. Jude Childrenís Research Hospital
Memphis, TN

Abstract

 Many contemporary statistical applications, such as data mining, analysis of
microarray gene expression data, etc., require performing thousands or tens of
thousands of hypothesis testing in a single data analysis project. It seems to
be a consensus now that the control of false discovery rate (FDR) approach is
much more preferred than the control of the family-wide type-I error rate in
such applications. It is natural and practical to reject all the hull hypotheses
with corresponding p-values less than a certain significance threshold; and the
key is the determine such a threshold given the p values. Benjamini and Hochberg
developed a simple procedure to generate the threshold by controlling the FDR
at a pre-specified level. Storey considered the problem from estimation point of
view by providing an estimator of the FDR at any pre-specified significance
threshold. In practice however, it could be difficult to strike a meaningful
balance between the significance threshold and the FDR level, thus further
statistical guidance may be desirable. This research develops a practical
significance threshold determination criterion, the profile information criterion
(PIC) to complement the existing FDR methods. Genovese and Wasserman considered
the total misclassification risk and an FDR-penalized minimization of the false
non-discovery rate (FNR). In contrast to these criteria, the driving term of PIC
is a functional of the p-value uniform quantile process reflecting the ìstochastic
smallnessî of the p-values than U(0,1), instead of the FNR, and the FDR penalty is
applied in a different way. A simulation study of PIC, along with some theoretical
rationale connecting PIC to asymptotic minimax estimation, will be presented.