INFERENCE AND INVESTIGATION OF MARINE MICROBIAL COMMUNITY STRUCTURES IN THE GLOBAL OCEANS
Marine microbial communities are complex, and represent a serious analytical challenge. The Bayesian model for inference of microbial community structure (BioMiCo) was used to characterize microbial populations using 16S rRNA within polar, tropical, and temperate environmental zones. Global-scale and local analyses were performed on 356 microbial samples and 72853 OTUs within the ICOMM database. Global analysis showed that polar and tropical zones had distinct community structures with high predictive value and little seasonal variation, although seasonal variation was noticeable in the temperate zone. Local analysis on polar communities demonstrated that there were distinct community structures for the Arctic and Antarctic zones. Within the North Atlantic, temporal heterogeneity differed locally, and this impeded the predictive value of models for the entire North Atlantic. Training a model on a single, well-sampled, North Atlantic site, L4 in the English Channel, substantially improved the predictive value of the model. Finally, the model for the L4 site had predictive value for other English Channel sites, but not for more distant sites within the western and eastern North Atlantic. This result appears to be due to differences among North Atlantic sites in the timing of their seasonal community transitions, and because most other sites have not been nearly as well sampled as the L4 site. The only other well-sampled site in the North Atlantic (Bedford Basin) also exhibits regular seasonal transitiona from year to year. Taken together, these results suggest that environmental changes are the primary drivers of marine biogeographic patterns within the North Atlantic. Four methodological investigations were applied to Arctic and Antarctic samples, and to the samples from L4 station in the English Channel, for the purpose of exploring the impact of how users might choose to make inferences using BioMiCo. The first was an exploration of different ways of defining the predominant OTUs within an assemblage. The size of the assemblage was very sensitive to the method. I recommend defining predominant OTUs as those having >0.01 posterior probability, as this was the most conservative. The second was an exploration of the impact of “burn-in”. As expected, increasing burin-in yielded more stable assemblages; however, the burn-in did not need to exceed 1000 iterations. The third was an exploration the effect of training and testing design on prediction of Arctic and Antarctic samples. The results showed that better predictions were obtained from larger training sets of data. However, training on more than 2/3 of the data did not generate significant improvement. Thus, designs such as leave-one-out cross validation can be reserved for cases where the total sample size is very small. Otherwise, uses should run several replicates on data randomly divided into 2/3 training sets and 1/3 test sets. The fourth explored the effect of pre-specifying different numbers of assemblages (the value of L within the model). The results showed that running 25 communities was sufficient. In conclusion, the choices that users make when running the MCMC can impact their results, but, the approach is robust and good results can be obtained with just L=25 if the training data is of a sufficient size, and if a sufficient amount of burn-in is discarded.