Repository logo
 

Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models

dc.contributor.authorManuele, Alexander
dc.contributor.copyright-releaseNot Applicableen_US
dc.contributor.degreeMaster of Computer Scienceen_US
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.external-examinern/aen_US
dc.contributor.graduate-coordinatorMike McAlisteren_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.thesis-readerVlado Keseljen_US
dc.contributor.thesis-readerSageev Ooreen_US
dc.contributor.thesis-supervisorRobert Beikoen_US
dc.date.accessioned2021-08-19T14:18:50Z
dc.date.available2021-08-19T14:18:50Z
dc.date.defence2021-08-16
dc.date.issued2021-08-19T14:18:50Z
dc.description.abstractNext-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each.en_US
dc.identifier.urihttp://hdl.handle.net/10222/80695
dc.language.isoenen_US
dc.subjectRepresentation Learningen_US
dc.subjectMachine Learningen_US
dc.subjectBioinformaticsen_US
dc.subjectLanguage Modellingen_US
dc.titleNovel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Modelsen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AlexanderManuele2021.pdf
Size:
1.94 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: