Author ORCID Identifier

Date of Graduation


Document Type

Dissertation (PhD)

Program Affiliation

Biostatistics, Bioinformatics and Systems Biology

Degree Name

Doctor of Philosophy (PhD)

Advisor/Committee Chair

Wenyi Wang

Committee Member

Guillermina Lozano

Committee Member

Jeffrey S. Morris

Committee Member

Jing Ning

Committee Member

Han Liang

Committee Member

Paul A. Scheet


My dissertation is focused on quantitative methodology development and application for two important topics in translational and clinical cancer research.

The first topic was motivated by the challenge of applying transcriptome sequencing (RNA-seq) to formalin-fixation and paraffin-embedding (FFPE) tumor samples for reliable diagnostic development. We designed a biospecimen study to directly compare gene expression results from different protocols to prepare libraries for RNA-seq from human breast cancer tissues, with randomization to fresh-frozen (FF) or FFPE conditions. To comprehensively evaluate the FFPE RNA-seq data quality for expression profiling, we developed multiple computational methods for assessment, such as the uniformity and continuity of coverage, the variance and correlation of overall gene expression, patterns of measuring coding sequence expression, phenotypic patterns of gene expression, and measurements from representative multi-gene signatures. Our results showed that the principle determinant of variance from these protocols was use of exon capture probes, followed by the conditions of preservation (FF versus FFPE), then phenotypic differences between breast cancers. We also successfully identified one protocol, with RNase H-based ribosomal RNA (rRNA) depletion, exhibited least variability of gene expression measurements, strongest correlation between FF and FFPE samples, and was generally representative of the transcriptome.

In the second topic, we focused on TP53 penetrance estimation for multiple primary cancers (MPC). The study was motivated by the high proportion of MPC patients observed in Li-Fraumeni syndrome (LFS) families, but no MPC risk estimates so far have been provided for a better clinical management of LFS. To this end, we proposed a Bayesian recurrent event model based on a non-homogeneous Poisson process in order to estimate a set of penetrance for MPC related to LFS. Toward the associated inference, we employed the familywise likelihood that allows for utilizing genetic information inherited through the family. The ascertainment bias, which is inevitable in rare disease studies, was also properly adjusted by inverse probability weighting scheme. We applied the proposed method to the LFS data, a family cohort collected through pediatric sarcoma patients at MD Anderson Cancer Center from 1944 to 1982. Both internal and external validation studies show that the proposed model provides reliable penetrance estimates for MPC in LFS, which, to the best of our knowledge, have never been reported in the LFS literatures yet.

The research I conducted during my PhD study will be useful to translational scientists who want to obtain accurate gene expression by applying RNA-seq technology to FFPE tumor tissue samples. This research will also be helpful to genetic counselors or genetic epidemiologists who need high-resolution penetrance estimates for primary cancer risk assessment.


Formalin-fixation and paraffin-embedding tissue, Gene expression, Library preparation, Breast cancer tissue, Coding region enrichment, RNA sequencing, age-at-onset penetrance, familywise likelihood, multiple primary cancers, Li-Fraumeni syndrome, recurrent event model



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.