PERSONALIZED TOPIC MODELLING OF DOMAIN-SPECIFIC DOCUMENT COLLECTIONS

Basquarane, Bhuvaneshwari

dc.contributor.author	Basquarane, Bhuvaneshwari
dc.date.accessioned	2023-04-12T13:26:39Z
dc.date.available	2023-04-12T13:26:39Z
dc.date.issued	2023-04-10
dc.identifier.uri	http://hdl.handle.net/10222/82374
dc.description.abstract	Topic modelling refers to the discovery of abstract topics in a document collection. The abstract topics are often described by a statistical model that models the probabilistic relationship between topics, documents and words, typically through identifying the distribution of words within the topic and the distribution of topics in a document. One criticism is that we recognize that there can be several possible sets of topics, so in this study, we propose a personalizable topic modelling algorithm wherein a user guides the method by suggesting edits to the statistical models. In order to do this, we build upon Top2Vec, a recent topic-modelling algorithm that represents documents by their embeddings and then defines topics as soft clusters of documents. In our approach, the users are allowed to provide feedback about the documents, which is then used to define a contrastive loss function for fine-tuning the pre-trained BERT model used to derive embeddings of documents. In this work, we made the following contributions. First, we encapsulate the Top2Vec algorithm within a probabilistic framework---which we call Probabilistic Top2Vec--- to represent the topics in terms of the joint probabilities of words, documents, and topics. Finally, we introduce two personalization techniques that allow the user to provide weaker word-level supervision---describing each topic with a few central words---and stronger document-level supervision---wherein the user explicitly places the document in the desired topic cluster---in guiding the topic discovery. We evaluate this model quantitatively with the help of an oracle on labelled datasets: the quantitative evaluations measure how well the model can adapt to user feedback with the help of an oracle simulating the user and help determine the appropriate hyperparameters of the algorithm. Based on our quantitative evaluations, providing even weak feedback to the model can result in topic modelling that better aligns with the user's preferences. These results can be further improved with document-level feedback. More specifically, the results of Top2Vec visualized as probabilities should enable the user to clearly understand the discovered topics and then provide the appropriate feedback to personalize the topic modelling result.	en_US
dc.language.iso	en_US	en_US
dc.subject	Topic Model	en_US
dc.subject	Personalization	en_US
dc.subject	Weak Supervision	en_US
dc.subject	Deep Learning	en_US
dc.subject	NLP	en_US
dc.subject	Contrastive Learning	en_US
dc.title	PERSONALIZED TOPIC MODELLING OF DOMAIN-SPECIFIC DOCUMENT COLLECTIONS	en_US
dc.type	Thesis	en_US
dc.date.defence	2023-04-05
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Master of Computer Science	en_US
dc.contributor.external-examiner	n/a	en_US
dc.contributor.graduate-coordinator	Dr. Mike McAllister	en_US
dc.contributor.thesis-reader	Dr. Vlado Kesselj	en_US
dc.contributor.thesis-reader	Dr. Ana Maguitman	en_US
dc.contributor.thesis-supervisor	Dr. Evangelos E. Milios	en_US
dc.contributor.thesis-supervisor	Dr. Axel Soto	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: BhuvaneshwariBasquarane2023.pdf
Size:: 1.216Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record