Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

Manuele, Alexander

Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

dc.contributor.author	Manuele, Alexander
dc.contributor.copyright-release	Not Applicable	en_US
dc.contributor.degree	Master of Computer Science	en_US
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.external-examiner	n/a	en_US
dc.contributor.graduate-coordinator	Mike McAlister	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.thesis-reader	Vlado Keselj	en_US
dc.contributor.thesis-reader	Sageev Oore	en_US
dc.contributor.thesis-supervisor	Robert Beiko	en_US
dc.date.accessioned	2021-08-19T14:18:50Z
dc.date.available	2021-08-19T14:18:50Z
dc.date.defence	2021-08-16
dc.date.issued	2021-08-19T14:18:50Z
dc.description.abstract	Next-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each.	en_US
dc.identifier.uri	http://hdl.handle.net/10222/80695
dc.language.iso	en	en_US
dc.subject	Representation Learning	en_US
dc.subject	Machine Learning	en_US
dc.subject	Bioinformatics	en_US
dc.subject	Language Modelling	en_US
dc.title	Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: AlexanderManuele2021.pdf
Size:: 1.94 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Faculty of Graduate Studies Online Theses