Author ORCID Identifier

Date of Graduation


Document Type

Dissertation (PhD)

Program Affiliation

Biostatistics, Bioinformatics and Systems Biology

Degree Name

Doctor of Philosophy (PhD)

Advisor/Committee Chair

Sanjay S. Shete

Committee Member

Xuelin Huang

Committee Member

Jeffrey T. Chang

Committee Member

Jian Wang

Committee Member

Marcos R. Estecio


Genetic sequencing has been recognized as an effective approach to accurately address biological problems, such as clinical detection of disease, mutation discovery, and targeting specific biomarkers associated with complex diseases. Compared with conventional Sanger sequencing, next-generation sequencing costs much less due to massively parallel high-throughput sequencing. However, due to large numbers of short read sequences, the accuracy of high-throughput sequencing data remains a challenge in that the data obtained from next-generation sequencing often has higher error rates, which may impact downstream genomic analysis. Even if the downstream genomic analysis performs well, the quality of the result will still be impacted by the quality of the data. Before proceeding to downstream analysis, data quality assessment is necessary. Error rate estimation studies have been able to describe the quality of sequencing reads obtained from a sample. However, these studies may have limitations when sequencing a new genome or under a linearity assumption that the number of sequences containing errors increases linearly with the number of error-free read counts, which may not be available for all types of sequencing data. Therefore, it is necessary to estimate sequencing error rates in a more reliable way. In this dissertation, we proposed an empirical error rate estimation approach that employs nonlinear statistical models of cubic smoothing splines and robust smoothing splines to analyze the association between the number of shadow counts and the number of error-free read counts. Traditional approaches to simulation when analyzing sequencing data may not reflect the real structure of the sequencing data. We also proposed a frequency-based simulation approach that mimics the real sequencing count framework and has more computational efficiency. Based on all the simulation scenarios tested, our proposed empirical error rate estimation approach provided more accurate estimations than the shadow regression approach. We also redefined the per-read error rate so that it is more flexible and provides more information according to the sequencing reads. The proposed empirical error rate estimation approach was applied to assess the sequencing error rates of bacteriophage PhiX DNA samples, a MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project. The proposed empirical error rate estimation approach is free from the limitation of a linearity assumption between the number of shadow counts and the number of error-free read counts and demonstrates more accurate error rate estimations for next-generation short read sequencing data.

The errors of next-generation sequencing data discussed in the literature concern sequences with up to two bases that are different from error-free reads by substitution. In this thesis, we also extended the study of error rate estimations to different shadow scenarios, including varying the substitution shadows to be sequences in which only one base is different, only two bases are different, and up to two bases are different form error-free sequencing reads, and extending the investigation to deletion and insertion error rates. The deletion and insertion error rates are calculated differently from multiple approaches. For the extended investigation, both simulation studies and real data analyses were performed using empirical error rate estimation approaches and shadow regression approaches. Under the simulation scenarios tested, the empirical error rate proved to be more accurate and resulted in less biased estimation for the deletion and insertion analysis.

In this dissertation, we also studied the human microbiome, which has been associated with complex diseases, for example, cardiovascular disease, diabetes, obesity, and specific cancers. To our knowledge, the existing literature has rarely discussed how to process raw human microbiome data and convert it into the format of downstream operational taxonomic units. We reviewed multiple statistical methods for processing, summarizing and analyzing microbiome data, and also provided detailed programming scripts about how to process human microbiome data into a downstream analysis format and assess alpha diversities, beta diversities, and the association between sample diversities and the outcome of interest. For illustration, the statistical approaches were also applied to analyze the foregut microbiome in esophageal adenocarcinoma.

For future directions of research, we provided a potentially effective strategy for analyzing longitudinal human microbiome data associated with complex diseases by using a Bayesian network approach. The Bayesian graphical model approach accounts for interaction with the microbiome and captures the common structure of change within the microbiome, which provides insights to predict the change in the composition of the microbiome over time and find specific microbiome taxa that may be associated with complex diseases.


Genetic, sequencing, error rates, human microbiome, spline, shadows, alpha diversity, beta diversity, Bayesian network

Available for download on Thursday, November 02, 2017