Marina Vannucci's professional page

Marina Vannucci
Noah Harding Professor and Chair of Statistics, Rice University

Research Interests

Dr. Vannucci is generally interested in the development of statistical models for complex problems. Her methodological research has focused in particular on the theory and practice of Bayesian variable selection techniques and on the development of wavelet-based statistical models and their application. She has also done recent work is in the area of graphical models. Dr. Vannucci's research has often been motivated by real problems that needed to be addressed with suitable statistical methods. Methodologies developed by Dr. Vannucci have found applications in chemometrics and, more recently, in high-throughput genomics and in neuroimaging. Dr. Vannucci has also an interest in structural bioinformatics and, in particular, on the important problem of protein structure prediction.

Theory and Methods: Bayesian variable selection, Graphical Models, Statistical Computing, Wavelets.

Applications: Chemometrics, Large-scale Genomic data, Neuroimaging, Structural Bioinformatics.

Research on Wavelet-based Statistical Modeling
I have been interested in the development of wavelet-based methods since my Ph.D. studies. In my early work I considered Bayesian shrinkage models with random wavelet coefficients, incorporating effifient algorithms for the computation of their covariances (Vannucci and Corradi, 1999 J. Royal Stat Soc, Series B). Later work focused on the development of Bayesian methods for wavelet-based modeling of functional data. My work on functional data, in particular, was the first contribution to the use of wavelet methods for dimension reduction when multiple curves are under study. Methodologies included curve regression models that relate a multivariate response to functional predictors (Brown et al., 2001 J. American Stat Assoc), nonparametric modeling of hierarchical functions with a novel application to experimental data arising from carcinogen-induced colon cancer in rodent models (Morris et al., 2003 J. American Stat Assoc), curve classification settings with binary and multinomial responses (Vannucci et al., 2005 ChemoLab). Later on, I also extended methods to discriminant analysis settings with Markov tree priors that map scale-location connections among wavelets coefficients (Stingo et al., 2012 Sinica). My other work in wavelet-based methods has broadly related to the modeling of time series data, with various applications. Contributions have included early applications of discrete wavelet transforms to the structural analysis of biological sequences, like proteins and genomes (Lio and Vannucci, 2000a,b Bioinformatics), methods for wavelet variance change point detection (Gabbanini et al., 2004 J. Comp & Graph Stats), multi-resolution variance decomposition methods for the analysis of response time in children with ADHD and healthy controls (Di Martino et al., 2008 Biological Psychiatry) and Bayesian wavelet-based models for long memory data (Ko and Vannucci, 2006 J. Stat Plan & Inf; Ko et al., 2009 Sinica), which I later incorporated into novel models for fMRI data (Jeong et al., 2013 Biometrics) (see also below).

Research on Bayesian Methods for Variable Selection
My initial work on Bayesian methods for variable selection was in multivariate regression settings, extending stochastic search variable selection methods that use mixture prior distributions with a spike at zero (Brown et al., 1998, 2002 J. Royal Stat Soc, Series B). The proposed methodologies apply to cases with a large number of explanatory variables, typically many more than the number of observations. Motivating applications were to chemometrics datasets involving near-infrared spectra (Brown et al., 1998 J. Chemometrics). Later contributions were in extending Bayesian techniques for variable selection to probit models with multinomial responses (Sha et al., 2004 Biometrics) and to accelerated failure time models with censored outcomes (Sha et al., 2006 Bioinformatics). Next, I proposed innovative ways to perform variable selection in model-based sample clustering (Tadesse et al., 2005 J. American Stat Assoc; Kim et al., 2006 Biometrika). Here, the problem can be formulated either in terms of finite mixture models or infinite mixture of distributions, via Dirichlet process mixtures (avoiding a reversible jump MCMC). Variable selection is performed through a binary latent vector that indicates discriminating and non-discriminating covariates and that gets updated via stochastic search techniques. Methods were also extended to the modeling of data with a known substructure, such as the structure imposed by an experimental design (Swartz et al., 2008 JABES). Other contributions to Bayesian variable selection include early explorations of spiked Dirichlet process priors, that incorporate point mass distributions in nonparametric mixture priors (Kim et al., 2009 Bayesian Analysis). Also, in Savitsky and Vannucci (2010 JPS) and Savitsky et al. (2011 Statistical Science) I extended variable selection methods to models that allow non-linear associations of a set of predictors to a response, in particular incorporating Gaussian processes in a generalized linear framework. Further developments of my work in Bayesian variable selection were motivated by interdisciplinary applications, particularly in high-throughput genomics and neuroimaging (see below).

Research on Biomarker Discovery in Large-Scale Genomic Data
High-throughput data represent a challenge to statistical analyses because of their high-dimensionality. Novel methodological questions are being generated and require the integration of different concepts, methods, tools and data types. Bayesian models are a very natural and flexible framework for such goals. My interest has been in the development of unified modeling frameworks that, though motivated by specific experimental studies and datasets, can be readily applied to high-throughput data of different types, and to data from different cancers and diseases. In doing this work I have benefitted from close collaborations with several teams of life-science investigators. Initially, I applied and extended my work in Bayesian variable selection to models for DNA microarray and mass spectrometry data, with applications to immunologic studies on arthritis and cancer (Sha et al., 2004 Biometrics, 2006 Bioinformatics; Kwon et al., 2007 Cancer Informatics). Later work focused on developing integrative models that also incorporated biological prior information specific to genomic data. In Stingo et al. (2010 Annals Appl Stats) a regulatory network was inferred by using a Bayesian graphical model that integrates expression levels of microRNAs, short RNA sequences that can down-regulate the levels of mRNAs, with their potential mRNA targets. The model also incorporates sequence/structure information, via the prior probability model. In Stingo et al. (2011 Annals Appl Stats) information on pathway-gene relationships and gene-gene networks was incorporated into a Bayesian model for the analysis of DNA microarray data. This model was subsequently used to identify pathways and genes implicated in the development of hypertension (Yang et al., 2013 Hypertension; Cowley et al., 2014 Physiological Genomics). Results were confirmed via biological validation. In Stingo and Vannucci (2011, Bioinformatics) network priors were used to incorporate information on gene-to-gene connections into a discriminant analysis setting for subject classification based on gene expression data. Finally, in Cassese et al. (2014a,b Annals Appl Stats and Cancer Informatics) integrative models that relate genotype data to mRNAs were developed, for the selection of copy number variants that affect the gene expression. In the proposed modeling framework, a measurement error model relates the gene expression levels to latent copy number states which, in turn, are related to the observed genotype data via a hidden Markov model. Selection priors explicitly incorporate dependence information between adjacent copy number states. In Trevino et al. (2016 PLoS Comp Biology) the model was used to analyze data on prostate cancer for the identification of genomic regions that explain the expression of "polarized" genes that may exert an effect on adjacent cells.

Research on Graphical Models
My work in genomic applications motivated me to expand my research interests to the topic of graphical models, as these are often used to infer biological networks based on genes, proteins or metabolites (Peterson et al., 2013 Stats & Interface). Contributions so far have been in particular to Bayesian methods for the estimation of multiple graphs, a topic that has recently attracted considerable interest. Peterson et al. (2015 J. American Stat Assoc), in particular, specifically address the problem of inferring multiple undirected networks in situations where some of the networks may be unrelated, while others share common features. In this approach the estimation of the graph structures is linked via a Markov random field (MRF) prior, which encourages common edges. This allows us to share information between sample groups, when appropriate, as well as to obtain a measure of relative network similarity across groups. The method has been applied to protein expression data from Alzheimer's patients (Rembach et al., 2015 J. Alzheimer's Disease) and to gene expression data from COPD patients at different disease stages (Shaddox et al., 2017 Stats in Bioscience).

Research in Neuroimaging
Functional magnetic resonance imaging (fMRI), a noninvasive neuroimaging method that provides an indirect measure of neuronal activity by detecting blood flow changes, has experienced an explosive growth in the past years. Bayesian approaches have shown great promise for the analysis of such data, as they allow a flexible modeling of spatial and temporal correlations (Zhang et al., 2015 WIREs Comp Stats). In my work I first looked at spatio-temporal models for the detection of brain voxels that activate in response to a stimulus (Jeong et al., 2013 Biometrics; Zhang et al., 2014 NeuroImage). I then extended the methodologies to studies on multiple subjects, proposing a unified Bayesian nonparametric framework that detects activated brain regions while simultaneously clustering spatially remote voxels within and across subjects via a spiked Dirichlet process prior (Zhang et al., 2016 Annals Appl Stats). Inference is carried out via a variational Bayes algorithm, to allow scalability of the methods. Other work has been in the context of imaging genetics, where structural and functional neuroimaging is applied to study subjects carrying genetic risk variants that relate to a psychiatric disorder. In Stingo et al. (2013 J. American Stat Assoc) an integrative Bayesian hierarchical mixture modeling approach is introduced, with an application to fMRI and single nucleotide polymorphisms (SNPs) data on schizophrenic patients and healthy controls. My recent interests in neuroimaging include the development of models for inference on brain connectivity, i.e. the study of how brain regions interact and share information with each other, a topic that also relates to graphical models. In Chiang et al. (2017 Human Brain Mapping) a multi-subject vector autoregressive (VAR) model is applied to resting-state fMRI data on epileptic patients and healthy controls, to infer connectivity networks at both the subject- and group-levels. The model also integrates structural imaging information into the prior model, encouraging non-zero connectivity between structurally connected regions.

Research on Bayesian Nonparametrics for Structural Bioinformatics
Some of my work has focused on structural bioinformatics and, in particular, on the important problem of protein structure prediction. Torsion angle pairs are generally used to describe the three dimensional structure of proteins. In Lennox et al. (2009 J. American Stat Assoc) a Dirichlet process mixture model was used to estimate the distribution of the torsion angles pairs. The model properly accounts for the wrapping of angular data by incorporating von Mises distributions. This innovative modeling approach made it possible to address the question of what distributions should be used when sampling to generate new candidate models for a protein's structure (Day et al., 2010 Bioinformatics). In Lennox et al. (2010 Annals Appl Stats) a new semiparametric model was proposed for the joint distributions of angle pairs at multiple sequence positions, permitting the sharing of information across sequence positions. This strategy successfully allowed to model the notoriously difficult loop and turn regions of protein structures (Chavan et al., 2011 PLoS Comp Biology). Latest work has been in constructing a suitable statistical model for a new construct, based on tertiary structure, to improve the prediction of protein contacts (Li et al., 2016 Bioinformatics). This body of work has been done in collaboration with David Dahl (BYU) and Jerry Tsai, a biochemist collaborator, now at U. Pacific, CA.

Contact Info:
Department of Statistics
Rice University
6100 Main Street, MS 138
Houston, TX 77005, USA
Office: 2083 Duncan Hall
Phone: (713) 348-6132
Fax: (713) 348-5476
E-mail: marina@rice.edu