Sponsoring Section/Society: ASA-SPES
Session Slot: 10:30-12:20 Wednesday
Estimated Audience Size: 60-80
AudioVisual Request: Two Overheads
Session Title: Chemometrics
Theme Session: No
Applied Session: Yes
Session Organizer: Remund, Kirk Pacific Northwest National Laboratory
Address: Research Scientist - Statistics Group Pacific Northwest National Laboratory - Battelle
Session Timing: 110 minutes total (Sorry about format):
Opening Remarks by Chair - 5 or 0 minutes First Speaker - 40 minutes Second Speaker - 40 minutes Floor Discussion - 25 minutes
Session Chair: Remund, Kirk Pacific Northwest National Laboratory
Address: Senior Research Scientist - Statistics Group Pacific Northwest National Laboratory - Battelle
1. Variable Selection in Chemometrics
Speigelman, Cliff, Texas A&M University
Address: Dept. Statistics Texas A&M University College Station, TX
McShane, Mike, Texas A&M University
Cot, Gerry , Texas A&M University
Abstract: Modern chemical instruments frequently generate data with thousands to tens of thousands of variables. In many situations known inputs influence a few of the many variables. Two excellent MATLAB toolboxes have a library of selection techniques for selecting variables from typical chemometric data. The best known is the PLS_Toolbox and another is the Calibration Toolbox. This talk focuses on the methods used in the Calibration Toolbox and theory underpinning those methods. The methods used in this toolbox are easy to understand and work reasonably for several types of chemometric data. We compare our results to those obtained from the highly reputed PLS_Toolbox.
2. Variable Selection for PLS Calibration Models by Genetic Algorithms
Wise, Barry M., Eigenvector Research, Inc.
Address: 830 Wapato Lake Road, Manson, WA, 98831 USA
Gallagher, Neal B., Eigenvector Research, Inc.
Abstract: When developing predictive models, there are often many predictor variables available on which the models may be based. This is particularly true in spectroscopy, where it is common to have thousands of variables from which to choose. Not all variables are equally useful for prediction. Modern regression methods, such as Partial Least Squares or Projection to Latent Structures, reduce the effect of extraneous variables by emphasizing variables which are correlated with each other and the variable to be predicted. In practice, however, extraneous variables seldom end up with regression coefficients that are identically zero. Because of this, these variables contribute to the prediction error on new samples. Thus, it would be desirable to eliminate these variables from the regression models entirely.
Several approaches to variable selection have been demonstrated. This includes stepwise regression methods, and other methods based on statistical significance tests. Genetic algorithms (GAs) have also been applied to the variable selection problem. In general, the goal of these methods is to determine a set of predictor variables which can be used with Ordinary Least Squares (OLS) to form predictive models. When this method is used, however, the multivariate advantage, i.e. the averaging effect of combining many variables, is lost.
Other variable selection methods have been developed for use with PLS models, such as Global Optimization of Linear Prediction Error (GOLPE) and Interactive Variable Selection (IVS). Both of these methods rely to some extent on cross-validation to determine which variables will be retained. Cross-validation, however, can "misbehave" when an optimization is done with respect to it, particularly when there are many degrees of freedom available to the optimization.
In this work we present a GA based method for selecting predictor variables for PLS models, resulting in models which retain the multivariate advantage. In addition, the method also attempts to avoid the pitfalls of over-fitting data when prediction error is optimized based on cross-validation. This is accomplished by randomization of test sets, and in spectroscopic applications, dividing the spectra into windows several variables wide which must be kept or deleted together. The method is demonstrated using NIR spectra of diesel fuels. The GA method is compared to several other variable selection techniques.
Discussant: Hardin, James W. Stata Corporation
Address: Senior Statistician Stata Corporation 702 University Dr. East College Station, TX 77845
List of speakers who are nonmembers: None