Next: asa.phys.es.05 Up: ASA Physical and Engineering Previous: asa.phys.es.03

asa.phys.es.04

Sponsoring Section/Society: ASA-SPES

Session Slot: 10:30-12:20 Wednesday

Estimated Audience Size: 60-80

AudioVisual Request: Two Overheads

Session Title: Chemometrics

Theme Session: No

Applied Session: Yes

Session Organizer: Remund, Kirk Pacific Northwest National Laboratory

Address: Research Scientist - Statistics Group Pacific Northwest National Laboratory - Battelle

Phone: 509-372-4729

Fax: 509-375-3614

Email: km_remund@pnl.gov

Session Timing: 110 minutes total (Sorry about format):

Opening Remarks by Chair - 5 or 0 minutes First Speaker - 40 minutes Second Speaker - 40 minutes Floor Discussion - 25 minutes

Session Chair: Remund, Kirk Pacific Northwest National Laboratory

Address: Senior Research Scientist - Statistics Group Pacific Northwest National Laboratory - Battelle

Phone: 509-372-4729

Fax: 509-375-2604

Email: km_remund@pnl.gov

1. Variable Selection in Chemometrics

Speigelman, Cliff, Texas A&M University

Address: Dept. Statistics Texas A&M University College Station, TX

Phone: 409-845-8887

Fax: 409-845-3144

Email: cliff@stat.tamu.edu

McShane, Mike, Texas A&M University

Cot, Gerry , Texas A&M University

Abstract: Modern chemical instruments frequently generate data with thousands to tens of thousands of variables. In many situations known inputs influence a few of the many variables. Two excellent MATLAB toolboxes have a library of selection techniques for selecting variables from typical chemometric data. The best known is the PLS_Toolbox and another is the Calibration Toolbox. This talk focuses on the methods used in the Calibration Toolbox and theory underpinning those methods. The methods used in this toolbox are easy to understand and work reasonably for several types of chemometric data. We compare our results to those obtained from the highly reputed PLS_Toolbox.

2. Variable Selection for PLS Calibration Models by Genetic Algorithms

Wise, Barry M., Eigenvector Research, Inc.

Address: 830 Wapato Lake Road, Manson, WA, 98831 USA

Phone: 509-687-2022

Fax: 509-687-7033

Email: bmw@eigenvector.com

Gallagher, Neal B., Eigenvector Research, Inc.

Abstract: When developing predictive models, there are often many predictor variables available on which the models may be based. This is particularly true in spectroscopy, where it is common to have thousands of variables from which to choose. Not all variables are equally useful for prediction. Modern regression methods, such as Partial Least Squares or Projection to Latent Structures, reduce the effect of extraneous variables by emphasizing variables which are correlated with each other and the variable to be predicted. In practice, however, extraneous variables seldom end up with regression coefficients that are identically zero. Because of this, these variables contribute to the prediction error on new samples. Thus, it would be desirable to eliminate these variables from the regression models entirely.

Several approaches to variable selection have been demonstrated. This includes stepwise regression methods, and other methods based on statistical significance tests. Genetic algorithms (GAs) have also been applied to the variable selection problem. In general, the goal of these methods is to determine a set of predictor variables which can be used with Ordinary Least Squares (OLS) to form predictive models. When this method is used, however, the multivariate advantage, i.e. the averaging effect of combining many variables, is lost.

Other variable selection methods have been developed for use with PLS models, such as Global Optimization of Linear Prediction Error (GOLPE) and Interactive Variable Selection (IVS). Both of these methods rely to some extent on cross-validation to determine which variables will be retained. Cross-validation, however, can "misbehave" when an optimization is done with respect to it, particularly when there are many degrees of freedom available to the optimization.

In this work we present a GA based method for selecting predictor variables for PLS models, resulting in models which retain the multivariate advantage. In addition, the method also attempts to avoid the pitfalls of over-fitting data when prediction error is optimized based on cross-validation. This is accomplished by randomization of test sets, and in spectroscopic applications, dividing the spectra into windows several variables wide which must be kept or deleted together. The method is demonstrated using NIR spectra of diesel fuels. The GA method is compared to several other variable selection techniques.

Discussant: Hardin, James W. Stata Corporation

Address: Senior Statistician Stata Corporation 702 University Dr. East College Station, TX 77845

Phone: 409-696-4600

Fax: 409-696-4601

Email: tech-support@stata.com

List of speakers who are nonmembers: None

Next: asa.phys.es.05 Up: ASA Physical and Engineering Previous: asa.phys.es.03

David Scott
6/1/1998