Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

Manuele, Alexander

Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

Files

AlexanderManuele2021.pdf (1.94 MB)

Date

2021-08-19T14:18:50Z

Authors

Manuele, Alexander

Abstract

Next-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each.

Keywords

Representation Learning, Machine Learning, Bioinformatics, Language Modelling

URI

http://hdl.handle.net/10222/80695

Collections

Faculty of Graduate Studies Online Theses

Full item page

Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections