The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences Dissertations and Theses (Open Access)
Statistical Methods for Two Problems in Cancer Research: Analysis of RNA-seq Data from Archival Samples and Characterization of Onset of Multiple Primary Cancers
Author ORCID Identifier
Date of Graduation
Biostatistics, Bioinformatics and Systems Biology
Doctor of Philosophy (PhD)
Jeffrey S. Morris
Paul A. Scheet
My dissertation is focused on quantitative methodology development and application for two important topics in translational and clinical cancer research.
The first topic was motivated by the challenge of applying transcriptome sequencing (RNA-seq) to formalin-fixation and paraffin-embedding (FFPE) tumor samples for reliable diagnostic development. We designed a biospecimen study to directly compare gene expression results from different protocols to prepare libraries for RNA-seq from human breast cancer tissues, with randomization to fresh-frozen (FF) or FFPE conditions. To comprehensively evaluate the FFPE RNA-seq data quality for expression profiling, we developed multiple computational methods for assessment, such as the uniformity and continuity of coverage, the variance and correlation of overall gene expression, patterns of measuring coding sequence expression, phenotypic patterns of gene expression, and measurements from representative multi-gene signatures. Our results showed that the principle determinant of variance from these protocols was use of exon capture probes, followed by the conditions of preservation (FF versus FFPE), then phenotypic differences between breast cancers. We also successfully identified one protocol, with RNase H-based ribosomal RNA (rRNA) depletion, exhibited least variability of gene expression measurements, strongest correlation between FF and FFPE samples, and was generally representative of the transcriptome.
In the second topic, we focused on TP53 penetrance estimation for multiple primary cancers (MPC). The study was motivated by the high proportion of MPC patients observed in Li-Fraumeni syndrome (LFS) families, but no MPC risk estimates so far have been provided for a better clinical management of LFS. To this end, we proposed a Bayesian recurrent event model based on a non-homogeneous Poisson process in order to estimate a set of penetrance for MPC related to LFS. Toward the associated inference, we employed the familywise likelihood that allows for utilizing genetic information inherited through the family. The ascertainment bias, which is inevitable in rare disease studies, was also properly adjusted by inverse probability weighting scheme. We applied the proposed method to the LFS data, a family cohort collected through pediatric sarcoma patients at MD Anderson Cancer Center from 1944 to 1982. Both internal and external validation studies show that the proposed model provides reliable penetrance estimates for MPC in LFS, which, to the best of our knowledge, have never been reported in the LFS literatures yet.
The research I conducted during my PhD study will be useful to translational scientists who want to obtain accurate gene expression by applying RNA-seq technology to FFPE tumor tissue samples. This research will also be helpful to genetic counselors or genetic epidemiologists who need high-resolution penetrance estimates for primary cancer risk assessment.
Formalin-fixation and paraffin-embedding tissue, Gene expression, Library preparation, Breast cancer tissue, Coding region enrichment, RNA sequencing, age-at-onset penetrance, familywise likelihood, multiple primary cancers, Li-Fraumeni syndrome, recurrent event model
Bioinformatics Commons, Biostatistics Commons, Computational Biology Commons, Genomics Commons, Medicine and Health Sciences Commons, Statistical Models Commons, Survival Analysis Commons