EFFICIENT CLUSTERING OF SHORT TEXT STREAMS WITH AN APPLICATION TO FIND DUPLICATE QUESTIONS IN STACK OVERFLOW

Rakib, Md Rashadul Hasan

dc.contributor.author	Rakib, Md Rashadul Hasan
dc.date.accessioned	2023-12-19T16:00:29Z
dc.date.available	2023-12-19T16:00:29Z
dc.date.issued	2023-12-18
dc.identifier.uri	http://hdl.handle.net/10222/83315
dc.description.abstract	This thesis focuses on the efficient clustering of short texts along with an application to find duplicate questions in Stack Overflow. In the first part of this thesis, we discuss static and dynamic clustering methods for short text corpora. In the second part, we discuss how we can apply static and dynamic clustering of short texts to find duplicate questions in Stack Overflow. Short text clustering is an important but challenging task due to the lack of context contained in short texts. In our first work, we overcome this problem by representing text using word embedding which allows us to capture similarity between texts sharing few or no common words. In addition, we investigate the impact of similarity matrix sparsification on the performance of short text clustering. In our second work, we improve the clustering result obtained from our first work by removing outliers from clusters and reclassifying them to proper clusters. This is repeated several times until the cluster partitions stabilize. In our first and second work, we cluster static collections of short texts where the number of clusters to be produced is known. However, in real time, short texts are continuously being generated in large volumes from different sources. This motivates us to develop an efficient dynamic clustering method that creates or updates the clusters when text collection changes over time (e.g., new tweets arrive or new questions are posted on a question-answering site). In this method, we index clusters to reduce the number of similarity computations while assigning a text to a cluster. Using our dynamic clustering method along with static clustering, we cluster Stack Overflow questions as they arrive over time. Using the clusters of questions, we recommend potential duplicates of a newly posted question. Experimental studies demonstrate that both our static and dynamic clustering methods of short texts perform better than that of the existing state-of-the-art methods in terms of clustering quality and running time on several short text datasets. We also demonstrate that by using the clusters obtained by our clustering method, we find more duplicate questions than an existing duplicate question finding system.	en_US
dc.language.iso	en	en_US
dc.subject	SHORT TEXT CLUSTERING	en_US
dc.subject	STREAM CLUSTERING	en_US
dc.subject	DUPLICATE QUESTION FINDING	en_US
dc.subject	STACK OVERFLOW	en_US
dc.title	EFFICIENT CLUSTERING OF SHORT TEXT STREAMS WITH AN APPLICATION TO FIND DUPLICATE QUESTIONS IN STACK OVERFLOW	en_US
dc.type	Thesis	en_US
dc.date.defence	2023-11-22
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	Diana Inkpen	en_US
dc.contributor.thesis-reader	Fernando Paulovich	en_US
dc.contributor.thesis-reader	Vlado Keselj	en_US
dc.contributor.thesis-supervisor	Norbert Zeh	en_US
dc.contributor.thesis-supervisor	Evangelos Milios	en_US
dc.contributor.manuscripts	Yes	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: MdRashadulHasanRakib2023.pdf
Size:: 814.7Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record