
Research
Dr. Vannucci's research focuses on the theory and practice of Bayesian variable selection techniques and on the development of wavelet-based statistical models and their application. Her work is often motivated by real problems that need to be addressed with suitable statistical methods. Methodologies developed by Dr. Vannucci for variable selection have found applications in chemometrics and in bioinformatics. Her work on the development of Bayesian wavelet-based models for functional data was the first contribution to the use of wavelet methods with multiple curves. Recent work of Dr. Vannucci has focused on structural bioinformatics and, in particular, on the important problem of protein structure prediction.
Complete list of publications
Research Group
Theory and Methods:
Applications:
Research on Bayesian Methods for Variable Selection
Dr. Vannucci's initial work on Bayesian variable selection was in multivariate regression settings, extending stochastic variable selection
methods that use prior distributions with a spike at zero, [1][3]. The proposed methodologies apply to cases with a large number of explanatory variables, typically many more than the number of observations. Motivating applications were to chemometrics datasets involving near-infrared spectra, [2]. Later work was on innovative ways of performing variable selection in model-based classification with probit models, [4], and in sample clustering, [5][6]. In clustering, the problem can be formulated either in terms of finite mixture models or infinite mixture of distributions, via Dirichlet process mixtures (avoiding a reversible jump MCMC). Variable selection is performed through a binary latent vector that indicates discriminating and non-discriminating covariates and that gets updated via stochastic search techniques. Other developments include extensions of the Bayesian techniques for variable selection to the analysis of censored survival data via accelerated failure time models, [7].
Research supported by NIH-NHGRI, R01-HG003319.
Relevant references:
Research on Wavelet-based Statistical Modeling
Research supported by NSF-CAREER award and NSF-DMS 0605001.
Relevant references:
Research on Biomarker Discovery in Large-Scale Genomic and Proteomic Data
An important aspect of bioinformatics is the integration of data of different forms. In [2] a Bayesian model
that combines DNA microarray data with genome sequence information is used to refine the search for DNA regulatory motifs,
DNA fragments in the upstream regions of genes to which transcription factors bind to regulate
transcription.
Some known regulatory binding sites were identified, as well as new motifs that constitute
promising sets for further assessment.
Research supported by NIH-NHGRI, R01-HG003319.
Relevant references:
Research on Bayesian Nonparametrics for Structural Bioinformatics
Research supported by NIH-NIGMS, R01-GM81631.
Relevant references:
[1] Brown, P.J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society, Series B, 60(3),
627-641. Click here for the Matlab code used in this paper.
[2] Brown, P.J., Vannucci, M. and Fearn, T. (1998). Bayesian wavelength selection in multicomponent analysis. Journal of Chemometrics, 12(3), 173-182. Click here for the Matlab code used in this paper.
[3] Brown, P.J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. Journal of the Royal Statistical Society, Series B, 64(3), 519-536.
Click
here for the dataset used in this paper and here for the Matlab code.
[4] Sha, N., Vannucci, M., Tadesse, M.G., Brown, P.J., Dragoni, I., Davies, N., Roberts, T.C., Contestabile, A., Salmon, N., Buckley, C. and Falciani, F. (2004).
Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics, 60, 812-819.
[5] Tadesse, M.G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100, 602-617.
[6] Kim, S., Tadesse, M.G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika, 93(4), 877-893.
[7] Sha, N., Tadesse, M.G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with consored outcome. Bioinformatics, 22(18), 2262-2268. (supplementary material).
Dr. Vannucci's early work on wavelets was on Bayesian
wavelet shrinkage, [1]. Later work focused on the development of Bayesian methods for wavelet-based modeling. Her work on functional data, in particular, was the first contribution to the use of wavelet methods for dimension reduction when multiple curves are under study. Contributions include curve regression models that relates a multivariate response to functional predictors, [2], nonparametric modeling of hierarchical functions with a novel application to experimental data arising from carcinogen-induced colon cancer in rodent models, [3], and curve classification settings with binary and multinomial responses, [5]. Other work in wavelet-based methods broadly relates to the modeling of time series data, with various applications. Contributions include Bayesian models for long memory data, [6][8] and methods for variance change point detection, [4] with recent applications to the denoising of mass spectrometry data, [7].
[1] Vannucci, M. and Corradi, F. (1999).
Covariance structure of wavelet coefficients: Theory and models in a Bayesian perspective.
Journal of the Royal Statistical Society, Series B, 61(4), 971-986.
Click here
for the Matlab code used in this paper (WavBox toolbox required).
[2] Brown, P.J., Fearn, T. and Vannucci, M. (2001).
Bayesian wavelet regression on curves with application to a
spectroscopic calibration problem. Journal of the American Statistical Association, 96, 398-408.
Click here for the dataset used in this paper (also available as part of the R package
ppls - Penalized Partial Least Squares) and here for the Matlab code (WavBox toolbox required).
[3] Morris, J.S., Vannucci, M., Brown, P.J. and Carroll, R.J. (2003). Wavelet-Based Nonparametric Modeling of Hierarchical
Functions in Colon Carcinogenesis (with discussion). Journal of the American Statistical Association, 98, 573-597.
[4] Gabbanini, F., Vannucci, M., Bartoli, G. and Moro, A. (2004). Wavelet Packet Methods for the Analysis of Variance of
Time Series with Application to Crack Widths on the Brunelleschi Dome. Journal of Computational and Graphical Statistics, 13(3), 639-658.
[5] Vannucci, M., Sha, N. and Brown, P.J. (2005). NIR and mass spectra classification: Bayesian methods for wavelet-based feature selection. Chemometrics and Intelligent Laboratory Systems, 77, 139-148.
[6] Ko, K. and Vannucci, M. (2006). Bayesian wavelet analysis of autoregressive fractionally integrated moving-average processes. Journal of Statistical Planning and Inference, 136(10), 3415-3434.
[7] Kwon, D.W., Vannucci, M., Song, J.J., Jeong, J. and Pfeiffer, R. (2008). A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8(15), 3019-3029.
[8] Ko, K., Qu, L. and Vannucci, M. (2009). Wavelet-based Bayesian estimation of partially linear regression models with long memory errors. Statistica Sinica, 19(4), 1463-1478.
High-throughput microarray and proteomics data represent a challenge to statistical analyses because of their
high-dimensionality. Dr. Vannucci's work in Bayesian methods for variable selection has found useful applications
in studies involving microarray data. Methodologies were used, for example, in immunologic studies on arthritis,
to identify biomarkers predictive of different stages of the disease, [1], and in cancer studies to identify
genes relevant to the survival of patients, [3].
Methods were also applied to mass spectrometry data, to identify proteins linked to disease status in a study conducted
at the National Cancer Institute, [4]. Lately attention has been devoted to aspects of the pre-processing of
mass spectrometry data, [6] and in particular to novel wavelet-based methods for denoising[5].
[1] Sha, N., Vannucci, M., Tadesse, M.G., Brown, P.J., Dragoni, I., Davies, N., Roberts, T.C., Contestabile, A.,
Salmon, N., Buckley, C. and Falciani, F. (2004).
Bayesian variable selection in multinomial probit models to identify molecular
signatures of disease stage. Biometrics, 60, 812-819.
[2] Tadesse, M., Vannucci, M. and Lio', P. (2004).
Identification of DNA regulatory motifs using Bayesian variable selection.
Bioinformatics, 20(16), 2553-2561.
[3] Sha, N., Tadesse, M.G. and Vannucci, M. (2006).
Bayesian variable selection for the analysis of microarray data with consored
outcome. Bioinformatics, 22(18), 2262-2268.
(supplementary material).
[4] Kwon, D.W., Tadesse, M.G., Sha, N., Pfeiffer, R.M. and Vannucci, M. (2007).
Identifying biomarkers from mass spectrometry data with ordinal outcome.
Cancer Informatics, 3, 19-28.
[5] Kwon, D.W., Vannucci, M., Song, J.J., Jeong, J. and Pfeiffer, R. (2008).
A novel wavelet-based thresholding method for the pre-processing of
mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8(15), 3019-3029.
[6] Cruz-Marcelo, A., Guerra, R., Vannucci, M., Li, Y., Lau, C. and Man, C. (2008).
Comparison of algorithms for
pre-processing of SELDI-TOF mass spectrometry data. Bioinformatics, 24(19), 2129-2136.
Recent work of Dr. Vannucci has focused on structural bioinformatics and, in particular, on the important problem
of modeling the so-called torsion angle pairs, generally used to describe the three dimensional structure of proteins.
In [1] a Dirichlet process mixture (DPM) model is used to estimate the distribution of the torsion angles.
In [2] bivariate von Mises distributions, that properly account for the wrapping of angular data, are
incorporated into the model. This innovative modeling approach makes it possible to address the question of what
distributions should be used when sampling to generate new candidate models for a protein's structure. While these
works model the torsion angles
at a single sequence position, in [3]
a new semiparametric
model is proposed for the joint distributions of angle
pairs at multiple sequence positions, permitting
the sharing of information across
sequence positions. Results show that this strategy
successfully models the notoriously
difficult loop and turn regions.
[1] Dahl, D.B., Bohannan, Z., Mo, Q., Vannucci, M. and
Tsai, J.W. (2008).
Assessing side-chain perturbations of the
protein backbone: A knowledge based classification of residue
Ramachandran space.Journal of Molecular Biology, 378,
749-758.
[2] Lennox, K.P., Dahl, D.B., Vannucci, M. and Tsai,
J.W. (2009).
Density estimation for protein conformation angles
using a von Mises distribution and Bayesian nonparametrics.
Journal of the American Statistical Association, 104, 586-596.
[3] Lennox, K.P., Dahl, D.B., Vannucci, M., Day, R. and
Tsai, J.W. (2010). A Dirichlet process mixture of hidden
Markov models for protein structure prediction.
Annals of Applied Statistics, accepted.
Copyright 2007-2009, Marina Vannucci