Efficient Computation and Application of Maximum Agreement Forests
Abstract
Rampant lateral gene transfer (LGT) among prokaryotes, hybridization in plants and
other reticulate evolutionary processes invalidate typical phylogenetic tree models by
violating the assumption that organisms only inherit genetic information from a single
parent species. Comparing the different evolutionary histories of multiple genes
is necessary to identify and assess these processes. In this work I develop efficient
approximation and fixed-parameter algorithms for computing rooted maximum agreement forests (MAFs) and maximum acyclic agreement forests (MAAFs) of pairs of
phylogenetic trees. Their sizes correspond to the subtree-prune-and-regraft (SPR)
distance and the hybridization number of these pairs of trees, which are important
measures of the dissimilarity of phylogenies used in studying reticulate evolution.
Although these MAFs and MAAFs are NP-hard to compute, my fixed-parameter
algorithms are practical because they scale exponentially with the computed distance
rather than the size of the trees. I contribute efficient fixed-parameter algorithms for
computing MAFs and MAAFs of two binary rooted trees and give the first efficient
fixed-parameter and approximation algorithms for computing MAFs of two multifurcating
rooted trees. My open-source implementation of the MAF algorithms is orders
of magnitude faster than previous approaches, reducing the time required to compute
SPR distances of 46 between trees of 144 species to fractions of a second whereas
previous approaches required hours to compute SPR distances of 25.
These fast MAF-based distance metrics enable the construction of supertrees to
reconcile a collection of gene trees and rapid inference of LGT. Simulations demonstrate that supertrees minimizing the SPR distance are more accurate than other
supertree methods under plausible rates of LGT. I constructed an SPR supertree
from a phylogenomic dataset of 40,631 gene trees covering 244 genomes from several
major bacterial phyla and inferred "highways" of gene transfer between these bacterial
classes and genera; a small number of these highways connect distantly related
genera and can highlight specific genes implicated in long-distance LGT. These fast
MAF algorithms are thus practical and enable new analyses of reticulate evolution.