Cross-Study Analyses of Microbial Abundance Using Generalized Common Factor Methods
Micro-organisms seem to flagellate about wherever they please, in our bodies and in the natural and built environments, but they are more cunning than their meandering behavior would suggest. By creating networks of biochemical pathways, communities of microbes are able to modulate the properties of their environment and even biochemical processes within their hosts. Next-generation high-throughput sequencing has led to a new frontier in microbiology and microbial ecology which promises the ability to leverage the microbiome for good in every facet of our lives, and the stakes are high as global society hurtles toward several apocalyptic ecological crises. However, along with the fascinating complexity of microbial community dynamics comes equally complex data considerations for researchers: genomic data are high-dimensional, sparse, noisy, and refuse to cooperate with authorities. In fact, they will not even cooperate with each other, which prohibits the sorts of consensus-based validation and meta-analysis that we rely on in science. In this thesis we propose an ensemble approach for cross-study exploratory analyses of microbial abundance data, in which we first estimate the variance-covariance matrix from each dataset assuming Poisson sampling, and subsequently model these covariances jointly so as to find a shared low-dimensional subspace of the feature space. By viewing the projection of the latent true abundances onto this common structure, the variation is pared down to that which is shared among all datasets, and is likely to reflect more generalizable biological signal than can be inferred from an individual dataset. We investigate several ways of achieving this, and demonstrate that they work well on simulated and real metagenomic data in terms of signal retention and interpretability.