Author ORCID Identifier
0000-0002-3935-8865
Date of Graduation
5-2025
Document Type
Dissertation (PhD)
Program Affiliation
Quantitative Sciences
Degree Name
Doctor of Philosophy (PhD)
Advisor/Committee Chair
Peng Wei, Ph.D.
Committee Member
Ryan Sun, Ph.D.
Committee Member
Christine B. Peterson, Ph.D.
Committee Member
Ken Chen, Ph.D.
Committee Member
Gaiane M. Rauch, M.D., Ph.D.
Committee Member
Jingfei Ma, Ph.D.
Abstract
Mediation analysis is a widely used statistical method for examining how molecular traits, such as gene or protein expression, act as intermediaries linking an exposure to a health outcome. For example, it can help explain how smoking affects disease risk through molecular changes. The rapid progress in high-throughput omics profiling technologies and large-scale epidemiology consortia, such as the Trans-Omics for Precision Medicine (TOPMed) program from the National Heart, Lung and Blood Institute (NHLBI) and UK Biobank, now has resulted in an extensive accumulation of genomic data for biomedical research and analysis. At the same time, it poses significant methodological challenges, including computational inefficiency, inter-study heterogeneity, and unmeasured confounding. This dissertation addresses these challenges through three methodological innovations designed to advance high-dimensional mediation analysis for omics mediators. First, a computationally efficient two-stage framework using cross-fitting is introduced for the variance-based R-squared total mediation effect measure, which is specifically developed for high-dimensional omics mediators. The method applies variable selection for true mediator identification and ordinary least squares regression for estimation. A Wald-type confidence interval is then constructed using a newly derived closed-form asymptotic distribution, eliminating the need for resampling techniques like bootstrapping. The proposed method achieves coverage probability comparable to existing methods while significantly improving computational efficiency. Next, a novel meta-analysis framework is developed to estimate the R-squared-based total mediation effects of high-dimensional mediators while accounting for inter-study heterogeneity. This framework relies only on summary statistics from individual studies within large-scale consortia or biobanks. We show that using summary statistics alone achieves promising coverage probability and comparable bias to individual-level data analysis, which requires more computational resources and financial resources. Finally, a multivariate Mendelian randomization (MR) framework is introduced for estimating R-squared-based causal mediation effects in high-dimensional multi-omics settings. This method leverages expression quantitative trait loci (eQTLs) as instrumental variables to reduce bias caused by unmeasured confounders. Extensive simulations validate that MR-based method outperforms the standard linear regression-based method in the presence of unmeasured confounders. Additionally, these methods are applied to major studies within the TOPMed program, including the Framingham Heart Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. These studies contain over 7,000 participants from diverse populations with multi-omics data, including transcriptomics and proteomics. The proposed methods are used to identify gene and protein expression as mediators of age-, sex-, and obesity-related effects on cardiovascular traits (e.g., high-density lipoprotein (HDL) cholesterol and systolic blood pressure). To further investigate the biological mechanisms, downstream analyses such as pathway enrichment analysis and functional annotation, canonical correlation analysis, and causal direction analysis are conducted. These analyses provide deeper insights into our findings and help validate the biological plausibility and robustness. In summary, this dissertation provides a cohesive methodological framework for high-dimensional multi-omics mediation analysis, elucidating molecular mechanisms underlying complex diseases and offering foundational insights for precision medicine and therapeutic target discovery.
Keywords
Mediation analysis, high-dimensional analysis, multi-omics, causal inference, meta-analysis, Mendelian randomization