Statistical challenges in analyzing mass
spectrometry proteomic data

Xihong Lin
Department of Biostatistics
School of Public Health
Harvard University


       In high-throughput mass spectrometry (MS) proteomic experiments, we can simultaneously detect and quantify a large number of peptides/proteins. Such techniques have good potentials for new biomarker discovery for diseases. Resulting data (spectra) from such experiments are large and can be treated as finely sampled functions. Most of the existing MS analysis involves multiple ad hoc sequential methods for preprocessing the MS data, such as baseline subtraction, truncation, normalization, peak detection and peak alignmen. We will discuss challenges in analyzing MS preteomic data and propose a unified statistical framework for pre-processing and post-processing mass spectra using advanced nonparametric regression and functional data analysis technqiues in conjunction with statistical learning methods. We stress that pre-processing is critical in analysis of mass spectrometry proteomic data. We apply the methodology to a motivating data set obtained from a study of lung cancer patients whose serum samples were collected and processed using a surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry (MS) instrument.