A hierarchical model of mutations with genotyping errors and maximum likelihood estimation of the male-to-female mutation rate ratio
In population genetics, short tandem repeat (STR), which is highly prone to mutations, plays a critical role in the estimation of germline mutation rate. The paternal-to-maternal mutation rate ratio is of major importance to the male-driven evolution hypothesis. However, a statistical framework for investigating the factors determining the ratio using STRs has not been established. Meanwhile, it remains a challenge to profile STRs from genotyping errors in large-scale genotyping or sequencing data. In our study, we introduced a likelihood-based model for estimating STR mutation rate and its association with the influencing factors (such as parental allele length, parental age, irregular changes, mutation directions, and mutation steps). The approach provides unbiased and efficient parameter estimates as well as model selection methods. We also extended this method to a hierarchical model for situations that genotyping errors exist. As both mutation rate and genotyping error rate increase with STR length, our extended model can access the mutation rates adjusted by accounting for genotyping error rate. For parameter optimization, we employed a grid search technique especially designed for solving multidimensional parameters constrained to small values. Using extensive simulations, we evaluated the performance of our models under different scenarios. The proposed models were applied to the genome-wide linkage microsatellite data of NHLBI Family Heart Study. Our results show that the estimates of paternal-to-maternal mutation rate ratios have a range of 1.1~1.9 and a stationary distribution for allele lengths is maintained by mutational directions under an assumption of multistep mutation model. The genotyping error rate is not negligible with an estimate of 5.5×10–3 per allele and increases with allele length significantly. The further applications of our methods include two aspects: first, the basic model without error terms can be directly applied to relative error-free data from repeated genotyping, such as deCODE data or paternity test data sets; second, the extended model with error terms will take account of high sequencing error rate, e.g. PCR stutter noise, in STRs inferred from Next Generation Sequencing (NGS) data.^
Sun, Jia, "A hierarchical model of mutations with genotyping errors and maximum likelihood estimation of the male-to-female mutation rate ratio" (2015). Texas Medical Center Dissertations (via ProQuest). AAI10027846.