Date of Graduation


Document Type

Dissertation (PhD)

Program Affiliation

Biostatistics, Bioinformatics and Systems Biology

Degree Name

Doctor of Philosophy (PhD)

Advisor/Committee Chair

Ken Chen

Committee Member

Keith Baggerly

Committee Member

Roel Verhaak

Committee Member

Han Liang

Committee Member

Marcos Estecio


Clinical sequencing has been recognized as an effective approach for enhancing the accuracy and efficiency of cancer patient management and therefore achieve the goals of personalized therapy. However, the accuracy of large scale sequencing data in clinics has been constrained by many different aspects, such as clinical detection, annotation and interpretation of the variants that are observed in clinical sequencing data. In my Ph.D thesis work, I mainly investigated how to comprehensively and efficiently apply high dimensional -omics data to enhance the capability of precision cancer medicine. Following this motivation, my dissertation has been focused on two important topics in translational genomics.

1) Developing a computational approach to resolve ambiguities in existing clinical genomic annotations and to facilitate correct diagnostic and treatment decisions. I have developed a multi-level variant annotator, TransVar, to perform precise annotation at genomic, mRNA and protein levels. TransVar implements three main functions: 1) it performs an innovative “reverse annotation” function, which identifies the genomic variants that can be translated into a given protein variant through alternative splicing. This function significantly improves the accuracy of genomic testing in clinics and functional validation in genomic laboratories; 2) It performs “equivalence annotation”, which identifies the protein variants having identical genomic origins with a given protein variant. This function resolves annotation inconsistencies among variants imported from different data sources, and is crucial for precise mutation biomarker identification and functional prediction; 3) It improves “forward annotation” (i.e., translation of genomic variants to protein variants) over existing annotators by more rigorously implementing the Human Genome Variation Society (HGVS) nomenclature. Our study also tried to illustrate the ambiguities of annotation among different transcript databases and different mutation types. TransVar standardizes mutation annotation and enables precise characterization of genomic variants in the context of functional genomic studies and clinical decision support and will significantly advance genomic medicine.

2) Developing a statistical framework to precisely identify hotspot mutations and investigate their functional impact on tumorigenesis and drug therapeutic response using large-scale -omics data. I have proposed a statistical model, which utilizes characteristics of genomic data to nominate 702 cancer type-specific hotspot mutations in 549 genes. It models background mutation rate variations among different genes, mutation subtypes and di-nucleotide sequence contexts and effectively identifies hotspots that have more than the expected number of recurrent mutations. We then investigate the mutational signatures represented by the hotspot mutations and find they vary from one tumor type to another, suggesting distinct mutational positive selections during different cancer progressions. In addition, we build an integrative statistical framework by using transcriptomics, proteomics and pharmacogenomics data to investigate the diverse functions of each hotspot mutation under different disease and biological contexts and to associate the effects of mutations on RNA/protein expression, pathway activity, and drug sensitivity. We not only validate diverse functions of well-known hotspot mutations in different contexts, but also identify some novel hotspot mutations such as MAP3K4 A1199 deletion, NR1H2 R175 insertion, and GATA3 P409 insertion with different functional associations. Our study addresses a long-term challenge of explicitly distinguishing driver mutations from passengers, and nominates a set of putative driver mutations that possess diverse functional potentials.

The translational genomics research I conducted in my Ph.D study will benefit the cancer research community. The tools I developed will answer translational genomics questions such as identification of biomarkers for clinical diagnostics and treatment, and promote our understanding of the biological function of driver mutations towards the realization of personalized medicine.


Cancer genomics, -omics, mutation annotation, functional prediction, precision medicine