Statistical Methods for Integrated Analysis of High-Throughput Biomedical Data



Abstract:

Technological advances have led to a rapid proliferation of high-throughput "omics" data in medicine that hold the key to clinically effective personalized medicine. To realize this goal, statistical and computational tools to mine this data and discover biomarkers, drug targets, disrupted disease networks, and disease sub-types are urgently needed. There are, however, two primary factors which make the development of such statistical tools challenging. First, many high-throughput genomic technologies produce varied heterogeneous data, which include continuous data (microarrays, methylation arrays), count data (RNA-sequencing), and binary/categorical data (SNPs, CNV). These varied data sets do not always satisfy typical distributional assumptions imposed by standard high-dimensional statistical models. Second, in order for scientists to leverage all of their data and understand the complete molecular basis of disease, these varied omics data sets need to be combined into a single multivariate statistical model. This proposal seeks to address these two issues with a new statistical framework for integrated analysis of multiple sets of high-dimensional data measured on the same group of subjects. The key statistical approach uses the theory of exponential family distributions to generalize two foundational high-dimensional statistical frameworks, principal components analysis (PCA) and graphical models, so as to jointly analyze transcriptional, epi-genomics and functional genomics data.

This research will be applied to high-throughput cancer genomics data and lead to new methods to (a) discover molecular cancer sub-types along with their genomic signatures and (b) build a holistic network model of disease. By leveraging information across all the different types available of genomic biomarkers, the proposed methods will have the potential to make scientific discoveries critical for personalized medicine. The proposed work will also be broadly applicable to integrating multiple sets of "omics" data, including genomics, proteomics, metabolomics, and imaging. Beyond medicine, the theoretical framework and statistical methods will make significant advances in the theory of exponential families, statistical learning, and the emerging field of integrative analysis as well as have broad applicability in other disciplines such as engineering and security. All results will be disseminated through publications, conferences, and open-source software; this research will also provide training and educational opportunities for doctoral and postdoctoral scholars.


PIs:

  • Genevera I. Allen (PI), Department of Statistics and Electrical and Computer Engineering, Rice University, Department of Pediatrics-Neurology, Baylor College of Medicine, & Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital.
  • Zhandong Liu (Co-PI), Department of Pediatrics-Neurology, Baylor College of Medicine, & Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital.
  • Pradeep Ravikumar (Co-PI), Department of Computer Science, University of Texas, Austin.

Collaborators:

  • Matthew Anderson (Co-I), Departments of Obstetrics and Gynecology & Pathology and Immunology, Baylor College of Medicine.
  • Lionel Chow (Collaborator), Department of Pediatrics, University of Cincinnati.

Students & Postdocs:

  • Eunho Yang, Department of Computer Science, UT Austin.
  • Yulia Baker, Department of Statistics, Rice University.
  • John Nagorski, Department of Statistics, Rice University.
  • Ying-Wooi Wan, Neurological Research Institute, Baylor College of Medicine.

Publications:

  • E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, "On Poisson Graphical Models", (To Appear) In Advances in Neural Information Processing Systems (NIPS), 2013.

  • E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, "Conditional Random Fields via Univariate Exponential Families", (To Appear) In Advances in Neural Information Processing Systems (NIPS), 2013.

  • W. Y. Wan, J. Nagorski, G. I. Allen, Z. Li, and Z. Liu, "Identifying cancer biomarkers through a network regularized Cox model", (To Appear) In IEEE International Workshop on Genomic Signal Processing and Statistics, 2013.

  • W. Zhang, W.Y. Wan, G. I. Allen, K. Pang, M. L. Anderson, and Z. Liu, "Molecular pathway identification using biological network-regularized logistic models", (To Appear) BMC Genomics, 2013.

  • G. I. Allen and Zhandong Liu, "A Local Poisson Graphical Model for Inferring Genetic Networks from Next Generation Sequencing Data", IEEE Transactions on NanoBioscience, 12:3, 1-10, 2013. [pdf]

  • E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, "On Graphical Models via Univariate Exponential Family Distributions", arXiv:1301.4183, 2013. [pdf]

  • G. I. Allen and Zhandong Liu, "A Log-Linear Graphical Model for Inferring Genetic Networks from High-Throughput Sequencing Data", In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2012. [pdf]

  • E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, "Graphical Models via Generalized Linear Models", In Neural Information Processing Systems (NIPS), oral presentation, 2012. [pdf]

Funding:

  • National Science Foundation DMS-1264058.
  • This research is funded in part by through support from the Ken Kennedy Institute for Information Technology at Rice University under the Collaborative Advances in Biomedical Computing 2011 seed funding program supported by the John and Ann Doerr Fund for Computational Biomedicine.