Regression and variable selection in large p, small n problems

Wei-Yin Loh

Department of Statistics
University of Wisconsin, Madison, WI 53706


           Data sets with very many predictor variables and small numbers of samples (such as occur in microarray data) often present special difficulties for model fitting. If some of the variables are noise and their number grows, the prediction accuracy of any sensible model-fitting algorithm will eventually decay. One way to delay the onset of decay is to reduce the number of variables by first eliminating the variables that do not significantly affect the response variable. Although variable selection techniques for linear models have been available for a long time, it is only very recently that corresponding techniques for nonparametric models have been considered. One technique uses the variable importance scores from Random Forest (Tuv, Borisov, and Torkkola, 2006) and another is a local polynomial-based tube-hunting method called EARTH (Doksum, Tang, and Tsui, 2007). We introduce yet another approach based on the GUIDE (Loh, 2002) regression tree algorithm and compare the computational and statistical effectiveness of the methods on real and simulated datasets.