Assessing and Improving the Reliability of Models of Molecular Evolution
MetadataShow full item record
The site models of codon substitution used to detect positive selection at amino acid sites first use a pre-screening likelihood ratio (LR) test for positive selection at the level of the protein. Due to statistical irregularity, the large-sample distributions of the LR statistic are often not justified and thresholds determined from the distributions can give larger than expected type I error rates. Presented in Chapter 2 is a modified LR test for protein-level selection. The modified LR test is shown to restore statistical regularity to give tractable LR statistic distributions. After the pre-screening LR test, most codon substitution models use an empirical Bayes approach to detect positive selection at individual amino acid sites. After model parameters are estimated via maximum likelihood, they are passed to Bayes formula to compute the posterior probability that a site evolved under positive selection. A difficulty with the empirical Bayes approach is that estimates with large errors can negatively impact classification. Presented in Chapter 3 is a new technique called smoothed bootstrap aggregation (SBA) that uses bootstrapping and kernel smoothing to accommodate uncertainty in the estimates. Simulation results show that SBA balances accuracy and power at least as well as Bayes empirical Bayes (BEB), and when parameter estimates are unstable, the performance gap between BEB and SBA can widen in favour of SBA. Branch-site models of codon substitution, like the site models, can detect positive selection at a subset of amino acid sites. Unlike the site models however, the branch-site pre-screening LR test limits positive selection to prespecified branches on the phylogeny. Chapter 4 includes new simulation studies, which show limitations to these widely used models. The branch-site LR distributions under the null hypothesis are sometimes poorly approximated by those predicted by theory and can vary heavily according to factors such as the branches considered for positive selection and irregularity of certain parameter estimates. Of particular concern is that uncontrolled false positives are shown to occur when positive selection has occurred in the tree but not along on the prespecified branches.