Research Interests
Dr. Vannucci is generally interested in the development of statistical models
for complex problems. Her methodological research has focused in particular on the theory and practice of
Bayesian variable selection
techniques and on the development of
wavelet-based statistical models
and their application. She has also done recent work is in the area of
graphical models.
Dr. Vannucci's research has often been motivated by real problems
that needed to be addressed with suitable statistical methods. Methodologies developed by Dr. Vannucci have found applications in chemometrics and,
more recently, in high-throughput genomics and
in neuroimaging.
Dr. Vannucci has also an interest in structural bioinformatics and,
in particular, on the important problem of
protein structure prediction.
Theory and Methods: Bayesian variable selection,
Graphical Models, Statistical Computing,
Wavelets.
Applications: Chemometrics,
Large-scale Genomic data,
Neuroimaging,
Structural Bioinformatics.
Research on Wavelet-based Statistical Modeling
I have been interested in the development of wavelet-based methods since my Ph.D. studies. In my early work I considered Bayesian shrinkage models with random wavelet coefficients, incorporating effifient algorithms for the computation of their covariances (Vannucci and Corradi, 1999 J. Royal Stat Soc, Series B). Later work focused on the development of Bayesian methods for wavelet-based modeling of functional data. My work on functional data, in particular, was the first contribution to the use of wavelet methods for dimension reduction
when multiple curves are under study. Methodologies included curve regression models that relate
a multivariate response to functional predictors (Brown et al., 2001
J. American Stat Assoc), nonparametric modeling of
hierarchical functions with a novel application to experimental data arising from carcinogen-induced
colon cancer in rodent models (Morris et al., 2003 J. American Stat Assoc), curve classification settings with binary and
multinomial responses (Vannucci et al., 2005 ChemoLab). Later on, I also extended methods to discriminant
analysis settings with Markov tree priors that map scale-location connections among wavelets
coefficients (Stingo et al., 2012 Sinica). My other work in wavelet-based methods has broadly related to the
modeling of time series data, with various applications. Contributions have included early applications
of discrete wavelet transforms to the structural analysis of biological sequences, like proteins and
genomes (Lio and Vannucci, 2000a,b Bioinformatics), methods for wavelet variance change point detection
(Gabbanini et al., 2004 J. Comp & Graph Stats), multi-resolution variance decomposition methods for the analysis of response
time in children with ADHD and healthy controls (Di Martino et al., 2008
Biological Psychiatry) and Bayesian wavelet-based
models for long memory data (Ko and Vannucci, 2006 J. Stat Plan & Inf; Ko et al.,
2009 Sinica), which I later incorporated into
novel models for fMRI data (Jeong et al., 2013 Biometrics) (see also below).
Research on Bayesian Methods for Variable Selection
My initial work on Bayesian methods for variable selection was in multivariate regression settings,
extending stochastic search variable selection methods that use mixture prior distributions with a
spike at zero (Brown et al., 1998, 2002 J. Royal Stat Soc, Series B). The proposed methodologies apply to cases with a large
number of explanatory variables, typically many more than the number of observations. Motivating
applications were to chemometrics datasets involving near-infrared spectra
(Brown et al., 1998 J. Chemometrics).
Later contributions were in extending Bayesian techniques for variable selection to probit models
with multinomial responses (Sha et al., 2004 Biometrics) and to accelerated failure time models with censored
outcomes (Sha et al., 2006 Bioinformatics). Next, I proposed innovative ways to perform variable selection in
model-based sample clustering (Tadesse et al., 2005 J. American Stat Assoc; Kim et al.,
2006 Biometrika). Here, the problem can be
formulated either in terms of finite mixture models or infinite mixture of distributions, via
Dirichlet process mixtures (avoiding a reversible jump MCMC). Variable selection is performed through
a binary latent vector that indicates discriminating and non-discriminating covariates and that gets
updated via stochastic search techniques. Methods were also extended to the modeling of data with a
known substructure, such as the structure imposed by an experimental design (Swartz et al., 2008 JABES).
Other contributions to Bayesian variable selection include early explorations of spiked Dirichlet
process priors, that incorporate point mass distributions in nonparametric mixture priors
(Kim et al., 2009 Bayesian Analysis). Also, in Savitsky and Vannucci (2010 JPS) and Savitsky et al. (2011 Statistical Science) I extended
variable selection methods to models that allow non-linear associations of a set of predictors to a
response, in particular incorporating Gaussian processes in a generalized linear framework. Further
developments of my work in Bayesian variable selection were motivated by interdisciplinary
applications, particularly in high-throughput genomics and neuroimaging (see below).
Research on Biomarker Discovery in Large-Scale Genomic Data
High-throughput data represent a challenge to statistical analyses because of their high-dimensionality.
Novel methodological questions are being generated and require the integration of different concepts,
methods, tools and data types. Bayesian models are a very natural and flexible framework for such
goals. My interest has been in the development of unified modeling frameworks that, though motivated
by specific experimental studies and datasets, can be readily applied to high-throughput data of
different types, and to data from different cancers and diseases. In doing this work I have benefitted
from close collaborations with several teams of life-science investigators. Initially, I applied and
extended my work in Bayesian variable selection to models for DNA microarray and mass spectrometry
data, with applications to immunologic studies on arthritis and cancer
(Sha et al., 2004 Biometrics, 2006 Bioinformatics;
Kwon et al., 2007 Cancer Informatics). Later work focused on developing integrative models that also incorporated
biological prior information specific to genomic data. In Stingo et al.
(2010 Annals Appl Stats) a regulatory network
was inferred by using a Bayesian graphical model that integrates expression levels of microRNAs,
short RNA sequences that can down-regulate the levels of mRNAs, with their potential mRNA targets.
The model also incorporates sequence/structure information, via the prior probability model. In
Stingo et al. (2011 Annals Appl Stats) information on pathway-gene relationships and gene-gene networks was incorporated
into a Bayesian model for the analysis of DNA microarray data. This model was subsequently used to
identify pathways and genes implicated in the development of hypertension
(Yang et al., 2013 Hypertension;
Cowley et al., 2014 Physiological Genomics). Results were confirmed via
biological validation. In Stingo and Vannucci (2011, Bioinformatics)
network priors were used to incorporate information on gene-to-gene connections into a discriminant
analysis setting for subject classification based on gene expression data. Finally, in
Cassese et al. (2014a,b Annals Appl Stats and Cancer Informatics) integrative models that relate genotype data to mRNAs were developed, for
the selection of copy number variants that affect the gene expression. In the proposed modeling
framework, a measurement error model relates the gene expression levels to latent copy number states
which, in turn, are related to the observed genotype data via a hidden Markov model. Selection priors
explicitly incorporate dependence information between adjacent copy number states. In Trevino et al.
(2016 PLoS Comp Biology) the model was used to analyze data on prostate cancer for the identification of genomic
regions that explain the expression of "polarized" genes that may exert an effect on adjacent cells.
Research on Graphical Models
My work in genomic applications motivated me to expand my research interests to the topic of graphical
models, as these are often used to infer biological networks based on genes, proteins or metabolites
(Peterson et al., 2013 Stats & Interface). Contributions so far have been in particular to Bayesian methods for the
estimation of multiple graphs, a topic that has recently attracted considerable interest. Peterson et
al. (2015 J. American Stat Assoc), in particular, specifically address the problem of inferring multiple undirected networks
in situations where some of the networks may be unrelated, while others share common features. In this
approach the estimation of the graph structures is linked via a Markov random field (MRF) prior, which
encourages common edges. This allows us to share information between sample groups, when appropriate,
as well as to obtain a measure of relative network similarity across groups. The method has been
applied to protein expression data from Alzheimer's patients
(Rembach et al., 2015 J. Alzheimer's Disease) and to gene
expression data from COPD patients at different disease stages
(Shaddox et al., 2017 Stats in Bioscience).
Research in Neuroimaging
Functional magnetic resonance imaging (fMRI), a noninvasive neuroimaging method that provides an
indirect measure of neuronal activity by detecting blood flow changes, has experienced an explosive
growth in the past years. Bayesian approaches have shown great promise for the analysis of such data,
as they allow a flexible modeling of spatial and temporal correlations
(Zhang et al., 2015 WIREs Comp Stats). In my
work I first looked at spatio-temporal models for the detection of brain voxels that activate in
response to a stimulus (Jeong et al., 2013 Biometrics;
Zhang et al., 2014 NeuroImage). I then extended the methodologies to
studies on multiple subjects, proposing a unified Bayesian nonparametric framework that detects
activated brain regions while simultaneously clustering spatially remote voxels within and across
subjects via a spiked Dirichlet process prior (Zhang et al., 2016
Annals Appl Stats). Inference is carried out via a
variational Bayes algorithm, to allow scalability of the methods. Other work has been in the context
of imaging genetics, where structural and functional neuroimaging is applied to study subjects
carrying genetic risk variants that relate to a psychiatric disorder. In
Stingo et al. (2013 J. American Stat Assoc) an
integrative Bayesian hierarchical mixture modeling approach is introduced, with an application to
fMRI and single nucleotide polymorphisms (SNPs) data on schizophrenic patients and healthy controls.
My recent interests in neuroimaging include the development of models for inference on brain
connectivity, i.e. the study of how brain regions interact and share information with each other,
a topic that also relates to graphical models. In Chiang et al. (2017
Human Brain Mapping) a multi-subject vector
autoregressive (VAR) model is applied to resting-state fMRI data on epileptic patients and healthy
controls, to infer connectivity networks at both the subject- and group-levels. The model also
integrates structural imaging information into the prior model, encouraging non-zero connectivity
between structurally connected regions.
Research on Bayesian Nonparametrics for Structural Bioinformatics
Some of my work has focused on structural bioinformatics and, in particular, on the important problem
of protein structure prediction. Torsion angle pairs are generally used to describe the three
dimensional structure of proteins. In Lennox et al. (2009
J. American Stat Assoc) a Dirichlet process mixture model was used
to estimate the distribution of the torsion angles pairs. The model properly accounts for the wrapping
of angular data by incorporating von Mises distributions. This innovative modeling approach made it
possible to address the question of what distributions should be used when sampling to generate new
candidate models for a protein's structure (Day et al., 2010 Bioinformatics). In Lennox et al. (2010 Annals Appl Stats) a new
semiparametric model was proposed for the joint distributions of angle pairs at multiple sequence
positions, permitting the sharing of information across sequence positions. This strategy successfully
allowed to model the notoriously difficult loop and turn regions of protein structures
(Chavan et al., 2011 PLoS Comp Biology). Latest work has been in constructing a suitable statistical model for a new construct, based on
tertiary structure, to improve the prediction of protein contacts
(Li et al., 2016 Bioinformatics). This body of work
has been done in collaboration with David Dahl (BYU) and Jerry Tsai, a biochemist collaborator, now at
U. Pacific, CA.