General statistical framework for disease risk prediction by genetic variants, gene expression and image
Abstract
Fast and more economical next-generation sequencing technologies will generate unprecedentedly massive and highly dimensional data on genomic and epigenomic variation. Medical records will include information on sequenced genomes in the near future. Methods for efficiently extracting biomarkers for risk prediction and treatment selection from millions or dozens of millions of genomic variants pose a significant challenge. Traditional paradigms for identifying variants of clinical validity involve testing associations of the variants. However, even genetic variants with statistically significant associations may or may not be useful for diagnosis or predicting response of disease to treatment. An alternative to association studies for finding genetic variants of predictive utility is to systematically search variants that contain sufficient information to predict phenotype. To achieve this, we introduce the concepts of sufficient dimension reduction (SDR) and coordinate hypothesis, which project the original high dimensional data to very low-dimensional spaces while preserving all information on response phenotypes. We then formulate a clinically significant genetic variant discovery problem into the sparse SDR problem and develop algorithms that can select significant genetic variants from millions of predictors with the aid of a split-and-conquer approach. The sparse SDR is in turn formulated as a sparse optimal scoring problem, but with penalties that can remove row vectors from the basis matrix. To speed up computation, we apply the alternating direction method of multipliers to solving the sparse optimal scoring problem, which can easily be implemented in parallel. To illustrate application of the proposed method, we have applied it to genome-wide association analyses (GWAS) of the datasets on rheumatoid arthritis from the North American Rheumatoid Arthritis Consortium (NARAC) and on psoriasis from the Genetic Association Information Network (GAIN). During the past decade, RNA sequencing (RNA-Seq) that uses deep-sequencing technologies has become a popular platform for gene expression profiling in whole-genome studies. We faced the same challenge with genome-wide association studies because of more than 10 million columns of reads per sample. We convert the RNA-Seq reads to functional principal component analysis (FPCA) scores to reduce the dimension to 10,000-50,000 columns and then use the sparse SDR to search the significant genomic variants for disease. We applied our method to kidney renal clear-cell carcinoma (KIRC) RNA-Seq data from The Cancer Genome Atlas (TCGA) project. Analysis of histologic images is a powerful new approach used to reveal variability among individuals and mechanisms of disease development. A histology image is usually large, containing about 109 pixels. To reduce the dimension and computation complexity, we extended a one-dimensional FPCA function to a two-dimensional FPCA function to extract the significant component factors. We applied this algorithm to KIRC histology image data from TCGA.
Subject Area
Biostatistics
Recommended Citation
Ma, Long, "General statistical framework for disease risk prediction by genetic variants, gene expression and image" (2015). Texas Medical Center Dissertations (via ProQuest). AAI3720088.
https://digitalcommons.library.tmc.edu/dissertations/AAI3720088