Statistical methods for genetic phenotype prediction and DNA sequence sample contamination detection

Yue-Ming Chen, The University of Texas School of Public Health


High-throughput genomic technologies offer powerful ways to identify genetic determinants of complex traits and diseases. The caveat is that when the genotype data are of poor quality genetic analysis results can be misleading. In this dissertation, we address two statistical issues that arise in genetic epidemiology using high-throughput genomic data, genetic prediction of complex phenotypes and data quality control.^ One of the ultimate public health applications of genetic epidemiology is to better predict health outcome so that prevention or intervention can be provided before the development of serious diseases. With evolving technologies in computing and biology, biological knowledge on the molecular level is fast accumulating. We consider biological knowledge from other domains as important prior knowledge in constructing prediction models. We propose and apply a weighting scheme to the existing polygenic modeling techniques to leverage external biological knowledge.^ Although much work has shown the potential of genotypes as risk predictors for complex diseases, assessing disease risk on an absolute scale with combined molecular and clinical data has more clinical meaning to clinicians as the absolute risk tells the size of actual individual risk. Most existing studies, that assess the added predictive value of genotypes on absolute risk scale in a specific disease, focus on single nucleotide polymorphisms (SNPs) with significant SNP-trait associations reported in previous genome-wide association studies (GWAS). Unlike those studies, we propose and implement a polygenic modeling-based algorithm for selecting predictive SNPs from genome-wide genotypes to achieve optimal discriminatory accuracy. The value of adding SNPs into an absolute risk model is assessed in terms of discriminatory ability and classification.^ Detecting DNA sample contamination is a crucial quality control step in studies using next-generation sequencing (NGS). We present an alternative statistical approach to this problem that takes patterns of correlation between genetic variants into account. Our model is essentially an application of low-coverage linkage disequilibrium (LD)-based imputation to the particularly challenging task of identifying a rare source of DNA sequence in a sample containing sequences from two human genomes.^

Subject Area


Recommended Citation

Chen, Yue-Ming, "Statistical methods for genetic phenotype prediction and DNA sequence sample contamination detection" (2016). Texas Medical Center Dissertations (via ProQuest). AAI10183292.