# Models of DNA sequence evolution

#### Abstract

Models of DNA sequence evolution and methods for estimating evolutionary distances are needed for studying the rate and pattern of molecular evolution and for inferring the evolutionary relationships of organisms or genes. In this dissertation, several new models and methods are developed. The rate variation among nucleotide sites: To obtain unbiased estimates of evolutionary distances, the rate heterogeneity among nucleotide sites of a gene should be considered. Commonly, it is assumed that the substitution rate varies among sites according to a gamma distribution (gamma model) or, more generally, an invariant+gamma model which includes some invariable sites. A maximum likelihood (ML) approach was developed for estimating the shape parameter of the gamma distribution $(\alpha)$ and/or the proportion of invariable sites $(\theta).$ Computer simulation showed that (1) under the gamma model, $\alpha$ can be well estimated from 3 or 4 sequences if the sequence length is long; and (2) the distance estimate is unbiased and robust against violations of the assumptions of the invariant+gamma model. However, this ML method requires a huge amount of computational time and is useful only for less than 6 sequences. Therefore, I developed a fast method for estimating $\alpha,$ which is easy to implement and requires no knowledge of tree. A computer program was developed for estimating $\alpha$ and evolutionary distances, which can handle the number of sequences as large as 30. Evolutionary distances under the stationary, time-reversible (SR) model: The SR model is a general model of nucleotide substitution, which assumes (i) stationary nucleotide frequencies and (ii) time-reversibility. It can be extended to SRV model which allows rate variation among sites. I developed a method for estimating the distance under the SR or SRV model, as well as the variance-covariance matrix of distances. Computer simulation showed that the SR method is better than a simpler method when the sequence length $L>1,000$ bp and is robust against deviations from time-reversibility. As expected, when the rate varies among sites, the SRV method is much better than the SR method. The evolutionary distances under nonstationary nucleotide frequencies: The statistical properties of the paralinear and LogDet distances under nonstationary nucleotide frequencies were studied. First, I developed formulas for correcting the estimation biases of the paralinear and LogDet distances. The performances of these formulas and the formulas for sampling variances were examined by computer simulation. Second, I developed a method for estimating the variance-covariance matrix of the paralinear distance, so that statistical tests of phylogenies can be conducted when the nucleotide frequencies are nonstationary. Third, a new method for testing the molecular clock hypothesis was developed in the nonstationary case.

#### Subject Area

Genetics

#### Recommended Citation

Gu, Xun, "Models of DNA sequence evolution" (1996). *Texas Medical Center Dissertations (via ProQuest)*. AAI9626090.

https://digitalcommons.library.tmc.edu/dissertations/AAI9626090