Author ORCID Identifier
Date of Graduation
Doctor of Philosophy (PhD)
Integrative genomic data analysis is a powerful tool to study the complex biological processes behind a disease. Statistical methods can model the interrelationships of the involved gene activities through jointly analyzing multiple types of genomic data from different platforms (vertical integration), or improve the power of a study through aggregating the same type of genomic data across studies (horizontal integration). In this dissertation, we propose statistical methods and strategies for integrative multi-omics data in association analysis of disease phenotypes, with an emphasis on cancer applications.
We develop a new strategy based on horizontal integration by leveraging publicly available datasets into the study cohort to improve the statistical power of a large p small n case-control epigenome-wide association study of pancreatic cancer (PanC). We demonstrated the effectiveness of our strategy through the detection and validation of methylation changes associated with PanC risk. We further discovered functional consequences of methylation changes on PanC risk and their causal relationships. The rest of this dissertation focuses on statistical methodology developments for vertical integration. It is common to perform association analysis of an outcome with each genomic data type separately and combine the results ad hoc, leading to loss of statistical power and uncontrolled FDR. We introduce a multivariate mixture model approach “IMIX” that models the inter-data-type correlation structures in the joint analysis of multi-omics data associated with a disease phenotype. Applications to The Cancer Genome Atlas data provided novel insights into the genes and mechanisms associated with the luminal and basal subtypes of bladder cancer and the prognosis of pancreatic cancer. We further incorporate spatial information into the multi-omics integration framework by proposing “spatial IMIX” using a spatial mixed model that characterizes spatial correlations between samples. Data applications to a geographically annotated tissue area of bladder cancer discovered cancer-initiating gene activities. Simulation studies demonstrate the statistical power gains of our methods over existing methods. Our methods feature model selection, FDR control, and computational efficiency.
The methodology frameworks established in this dissertation bring novel insights into integrative genomics research, providing new directions for improved statistical power in gene discovery associated with cancer. This enhances understanding of genetic mechanisms and biological processes behind cancer development and progression, eventually leading to better preventive and therapeutic interventions.
Mixture Model, Integrative Genomics, Spatially Correlated Data, Cancer, EWAS