Advances in bacterial promoter prediction: Genome-wide and cross-species approaches for enhanced identification, and comparative studies
Date
2024-12-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Promoters are short DNA segments that play a key role in initiating the expression of genes, which is critical to the production of proteins. Experimental methods to identify promoters are time-consuming and expensive, thus computational methods are an attractive alternative. Automating the identification of promoters can be a valuable supplement to experimental validation, with the possibility of enhancing its efficiency. Yet, promoter peculiarities such as short sequence length, variable composition, and lack of reliable annotated samples for most organisms make them challenging to identify. Many solutions have tried to address this problem, but promoter properties make them difficult to predict, and inconsistencies in the definitions of promoters and promoter sets make predictions difficult across methods.
The main objective of this thesis is to better understand and improve bacterial promoter prediction. We do this first by introducing a new classification tool for bacterial promoters, Expositor, and a new feature encoder, pc3mer; then extending a model originally trained using reference promoters to probable promoters in other bacterial species and naming it ExpositorOS. Then, we attempt to improve our solution with recently published deep-learning methods and sequence representations, extending ExpositorOS to create ExpositorOS-t-mer; however, using our standardized dataset we observe no improvement relative to our previously introduced method. We finish with a comparative analysis of recently published promoter-prediction tools that use different feature encodings, machine-learning approaches, and even species used for model training on the promoter prediction problem. We observe a striking lack of concordance between the predictions of different approaches, especially with increasing levels of sequence divergence.
We also demonstrate that the absence of a standardized promoter definition makes it difficult to compare accuracy across approaches. Future advances will depend on several factors, including new DNA feature encodings and classification tools, and, especially, carefully defined promoter sets to ensure that correct promoter properties are learned and to elucidate the impact of new techniques.
Description
Keywords
Machine learning, Promoter, Bacteria, Prokaryotes, Prediction, Tokenization, Feature encoding