Repository logo

Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

Loading...
Thumbnail Image

Authors

Manuele, Alexander

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Next-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each.

Description

Keywords

Representation Learning, Machine Learning, Bioinformatics, Language Modelling

Citation