Regression and variable selection in large p, small n problems WeiYin Loh Department of Statistics University of Wisconsin, Madison, WI 53706 Abstract
Data sets with very many predictor variables and small
numbers of samples (such as occur in microarray data) often present
special difficulties for model fitting. If some of the variables are
noise and their number grows, the prediction accuracy of any sensible
modelfitting algorithm will eventually decay. One way to delay the
onset of decay is to reduce the number of variables by first
eliminating the variables that do not significantly affect the response
variable. Although variable selection techniques for linear models have
been available for a long time, it is only very recently that
corresponding techniques for nonparametric models have been considered.
One technique uses the variable importance scores from Random Forest
(Tuv, Borisov, and Torkkola, 2006) and another is a local
polynomialbased tubehunting method called EARTH (Doksum, Tang, and
Tsui, 2007). We introduce yet another approach based on the GUIDE (Loh,
2002) regression tree algorithm and compare the computational and
statistical effectiveness of the methods on real and simulated datasets.
