INFERRING ORTHOLOGOUS RELATIONSHIPS AND GENE TRANSFER IN MICROBIAL GENOMES AND METAGENOMES
Wong, Dennis H.-J.
MetadataShow full item record
Interest in microbial life and the progress of DNA sequencing technology has led to thousands of sequenced bacterial genomes. In this thesis I develop approaches to identify Lateral Gene Transfer (LGT) in metagenomes, develop fast sequence clustering approaches to create clusters necessary in comparative genomics analyses, and apply them to large data sets. In chapter two, I identify LGT in two of three metagenomes of phosphorus-removing bacteria in sewage-treatment plants, none in a United States of America community, two in a Danish community and five in an Australian community. Analyses account for the limitations of metagenomic sequence data and focus on gene transfers in energy-related metabolic pathways. These transfers impact pathways associated with the different input carbon feeds for each community, suggesting recent adaptation among community members. This is the first published analysis focusing on the role and direction of transferred genes in a community using metagenomes. In chapter three, I develop two methods to define and refine clusters of homologous sequences from sequenced genomes: ProPhylClust to identify large protein families, and PhyloSubClust to subcluster large protein families based on phylogeny to recover orthologous relationships. ProPhylClust uses a species phylogeny as a guide tree for runtimes with approximately linear scaling relative to the runtimes of all-versus-all homology-search methods that scale quadratically with increasing numbers of genomes. Two different sets of genomes were used, one spanning 24 bacterial phyla and the other sampled from the phylum Proteobacteria. While the sequence comparisons in ProPhylClust make it slower than competing approaches on small genome sets, the hierarchical approach of ProPhyClust yielded equal or faster runtimes on sets with 100 or more genomes. In chapter four, 558 incomplete and complete genomes from the class Clostridia were clustered using ProPhylClust and PhyloSubClust. Of 18 clusters containing toxin proteins and their regulators from Peptoclostridium difficile (toxins A/B), Clostridium botulinum (botulinum toxin) and Clostridium tetani (tetanus toxin), one botulinum-tetani toxin cluster and a toxin A/B cluster, revealed homologous sequences considered non-toxic. Hierarchical clustering of phylogenetic profiles identified potentially toxin-related protein families with unknown function located on the same sequence contig or chromosome, but not in toxin operons. The computational analysis of large genomic data sets to derive biologically relevant knowledge will continue to be a challenge for years to come. Here, I focused on computational methods relevant to identifying LGT in environmental sequence data, constructing clusters of homologous sequences from genomes, and obtaining functionally associated sequences based on phylogenetic distributions. Promising results were produced for each chapter, with gene transfer events found in phosphorus removing sewage treatment communities, runtimes for cluster construction that are more manageable than other methods with larger data sets, and sequences that possibly are functionally relevant to toxins in C. botulinum and P. difficile.