Statistical Methods for Gene-Environment Interactions and High-Dimensional Mediation Analysis

Tianzhong Yang, The University of Texas School of Public Health


As whole-exome/genome sequencing data become increasingly available in genetic epidemiology research consortia, there is emerging interest in testing the interactions between rare genetic variants and environmental exposures that modify the risk of complex diseases. However, testing rare-variant-based gene-by-environment interactions (GxE) is more challenging than testing the genetic main effects due to the difficulty in correctly estimating the latter under the null hypothesis of no GxE effects and the presence of neutral variants. In response, we have developed a family of powerful and data-adaptive GxE tests, called “aGE” tests, in the framework of the adaptive powered score test, originally proposed for testing the genetic main effects. We show that aGE tests can control the type I error rate in the presence of a large number of neutral variants or a nonlinear environmental main effect, and the power is more resilient to the inclusion of neutral variants than that of existing methods. To further increase the power of GxE and improve the understanding of underlying molecular mechanisms, we have proposed to incorporate genome functional information to the proposed aGE test, which can address multiple sets of external weights, for example, derived from gene expressions in different tissues. We show that this test can control the type 1 error rate, and the power is resilient to the inclusion of non-causal variants and non-informative external weights. We demonstrate the performance of the proposed aGE and aGE weighted test using Pancreatic Cancer Case-Control Consortium data. Environmental exposures can regulate intermediate molecular phenotypes by different mechanisms and thus lead to different health outcomes. It is of significant scientific interest to explore the relationship between environmental exposure and traits beyond association and to unravel the role of potentially high-dimensional intermediate phenotypes. Mediation analysis is an important tool to investigate such relationship. However, there is a lack of a good overall measure of mediation effect, especially under the high-dimensional setting. Here we propose extending an R-squared (Rsq) effect size measure, originally proposed in the single-mediator setting, to the multiple and high-dimensional setting. We showed that our new measure outperforms several frequently-used mediation measures, including product, proportion and ratio measure in terms of bias and variance. To mitigate potential bias induced by non-mediators, we further examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate, to exclude non-mediators and we evaluate the consistency of the estimation procedures. Lastly, we applied our novel Rsq measure to quantify the amount of variation of systolic blood pressure and lung function explained by gene expression in the Framingham Heart Study and introduce a resampling-based confidence interval for this Rsq measure.

Subject Area


Recommended Citation

Yang, Tianzhong, "Statistical Methods for Gene-Environment Interactions and High-Dimensional Mediation Analysis" (2018). Texas Medical Center Dissertations (via ProQuest). AAI10931143.