dc.contributor.author | Manuele, Alexander | |
dc.date.accessioned | 2021-08-19T14:18:50Z | |
dc.date.available | 2021-08-19T14:18:50Z | |
dc.date.issued | 2021-08-19T14:18:50Z | |
dc.identifier.uri | http://hdl.handle.net/10222/80695 | |
dc.description.abstract | Next-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each. | en_US |
dc.language.iso | en | en_US |
dc.subject | Representation Learning | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Bioinformatics | en_US |
dc.subject | Language Modelling | en_US |
dc.title | Novel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Models | en_US |
dc.date.defence | 2021-08-16 | |
dc.contributor.department | Faculty of Computer Science | en_US |
dc.contributor.degree | Master of Computer Science | en_US |
dc.contributor.external-examiner | n/a | en_US |
dc.contributor.graduate-coordinator | Mike McAlister | en_US |
dc.contributor.thesis-reader | Vlado Keselj | en_US |
dc.contributor.thesis-reader | Sageev Oore | en_US |
dc.contributor.thesis-supervisor | Robert Beiko | en_US |
dc.contributor.ethics-approval | Not Applicable | en_US |
dc.contributor.manuscripts | Not Applicable | en_US |
dc.contributor.copyright-release | Not Applicable | en_US |