Advances in bacterial promoter prediction: Genome-wide and cross-species approaches for enhanced identification, and comparative studies

Rafante Bernardino, Miria

Advances in bacterial promoter prediction: Genome-wide and cross-species approaches for enhanced identification, and comparative studies

Files

MiriaRafanteBernardino2024.pdf (7 MB)

Date

2024-12-10

Authors

Rafante Bernardino, Miria

Abstract

Promoters are short DNA segments that play a key role in initiating the expression of genes, which is critical to the production of proteins. Experimental methods to identify promoters are time-consuming and expensive, thus computational methods are an attractive alternative. Automating the identification of promoters can be a valuable supplement to experimental validation, with the possibility of enhancing its efficiency. Yet, promoter peculiarities such as short sequence length, variable composition, and lack of reliable annotated samples for most organisms make them challenging to identify. Many solutions have tried to address this problem, but promoter properties make them difficult to predict, and inconsistencies in the definitions of promoters and promoter sets make predictions difficult across methods. The main objective of this thesis is to better understand and improve bacterial promoter prediction. We do this first by introducing a new classification tool for bacterial promoters, Expositor, and a new feature encoder, pc3mer; then extending a model originally trained using reference promoters to probable promoters in other bacterial species and naming it ExpositorOS. Then, we attempt to improve our solution with recently published deep-learning methods and sequence representations, extending ExpositorOS to create ExpositorOS-t-mer; however, using our standardized dataset we observe no improvement relative to our previously introduced method. We finish with a comparative analysis of recently published promoter-prediction tools that use different feature encodings, machine-learning approaches, and even species used for model training on the promoter prediction problem. We observe a striking lack of concordance between the predictions of different approaches, especially with increasing levels of sequence divergence. We also demonstrate that the absence of a standardized promoter definition makes it difficult to compare accuracy across approaches. Future advances will depend on several factors, including new DNA feature encodings and classification tools, and, especially, carefully defined promoter sets to ensure that correct promoter properties are learned and to elucidate the impact of new techniques.

Keywords

Machine learning, Promoter, Bacteria, Prokaryotes, Prediction, Tokenization, Feature encoding

URI

https://hdl.handle.net/10222/84765

Collections

Faculty of Graduate Studies Online Theses

Full item page

Advances in bacterial promoter prediction: Genome-wide and cross-species approaches for enhanced identification, and comparative studies

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections