Unsupervised Clustering of Time Series from Microbial Marker-Gene Data
Microorganisms interact with each other and the world around us, impacting every environment that they inhabit. DNA sequencing technology allows us to monitor entire communities of microorganisms. Using taxonomic marker genes, the abundance of thousands of microbial species can be tracked across time. Marker-gene data sets are often very large, requiring data reduction techniques for effective analysis. The typical approach involves clustering the DNA sequences by sequence identity, grouping similar sequences into operational taxonomic units. The emergence of marker-gene data sets with a temporal component offers opportunities to cluster genes based on temporal correlation rather than sequence identity; such an approach may be more effective in revealing ecologically meaningful associations. In this work, we describe an algorithm and software package for clustering marker-gene data based on time-series profiles. We present an efficient, interactive, and cross-platform solution that takes the user from raw sequence data to informative visualizations of the inferred clusters. We validate our method on simulated data and apply it to several longitudinal marker-gene data sets including faecal communities from the human gut, and communities from a freshwater lake sampled over eleven years. Within the gut, the segregation of the time series around a food poisoning event was immediately clear. In the freshwater lake, an annual summer bloom seasonal dynamics were isolated and highlighted by our method. We show that high sequence similarity between marker genes does not guarantee similar temporal dynamics. As a result, clustering based on sequence identity alone would hide many important patterns in these data sets. Our algorithm and visualization platform bring these patterns back to the surface. Finally, we demonstrate that multiple time series can be clustered simultaneously, providing a unique way to visualize marker-gene data sets with both longitudinal and cross-sectional components.