Date of Graduation

8-2015

Document Type

Dissertation (PhD)

Program Affiliation

Biostatistics, Bioinformatics and Systems Biology

Degree Name

Doctor of Philosophy (PhD)

Advisor/Committee Chair

Wenyi Wang, Ph.D.

Committee Member

Keith A. Baggerly, Ph.D.

Committee Member

Han Liang, Ph.D.

Committee Member

Paul A. Scheet, Ph.D.

Committee Member

Louise C. Strong, M.D.

Abstract

Next generation sequencing technology has been widely used in genomic analysis, but its application has been compromised by the missing true variants, especially when these variants are rare. We proposed a family-based variant calling method, FamSeq, integrating Mendelian transmission information with de novo mutation and sequencing data to improve the variant calling accuracy. We investigated the factors impacting the improvement of family-based variant calling in simulation data and validated it in real sequencing data. In both simulation and real data, FamSeq works better than the single individual based method.

In FamSeq, we implemented four different methods for the Mendelian genetic model to accommodate variations in data complexity. We parallelized the Bayesian network algorithm on an NVIDIA graphics processing unit to make the algorithm 10 times faster for relatively large families. Our simulation shows that Elston-Stewart algorithm performs the best when there is no loop in the pedigree. If there are loops, we recommend the Bayesian network method, which provides exact answers.

The next generation sequencing technology has been developed over ten years. Many different sequencing platforms have been created to generate the sequencing data. Although all these platforms have their own strengths and weaknesses, people usually focus on one latest platform. Here we propose a method based on Bayesian hierarchical model to combine the sequencing data from multiple platforms. Our method was applied to both the simulation and real data. The result showed that our method reduced the variant calling error rate comparing to single platform method.

Besides the application of Mendelian transmission in sequencing data analysis, we also use it to estimate the TP53 mutation carrier probability for Li-Fraumeni syndrome (LFS). LFS is an autosomal dominant hereditary disorder. People with LFS have high risk of developing early onset cancers. We proposed LFSpro that is built on a Mendelian model and estimates TP53 mutation probability, incorporating de novo mutation rates. With independent validation data from 765 families, we compared estimations using LFSpro versus classic LFS and Chompret clinical criteria. LFSpro outperformed Chompret and classic criteria in the pediatric sarcoma cohort and was comparable to Chompret criteria in the adult sarcoma cohort.

Keywords

Next Generation Sequencing, Variant Calling, Mendelian, Risk Prediction, Li-Fraumeni Syndrome, Germline Mutation, TP53, Multi-platform, Bayesian

Share

COinS