EFFICIENT CLUSTERING OF SHORT TEXT STREAMS WITH AN APPLICATION TO FIND DUPLICATE QUESTIONS IN STACK OVERFLOW
Date
2023-12-18
Authors
Rakib, Md Rashadul Hasan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis focuses on the efficient clustering of short texts along with an application to find duplicate questions in Stack Overflow. In the first part of this thesis, we
discuss static and dynamic clustering methods for short text corpora. In the second part, we discuss how we can apply static and dynamic clustering of short texts to
find duplicate questions in Stack Overflow.
Short text clustering is an important but challenging task due to the lack of context contained in short texts. In our first work, we overcome this problem by representing
text using word embedding which allows us to capture similarity between texts sharing few or no common words. In addition, we investigate the impact of similarity matrix
sparsification on the performance of short text clustering. In our second work, we improve the clustering result obtained from our first work by removing outliers from
clusters and reclassifying them to proper clusters. This is repeated several times until the cluster partitions stabilize.
In our first and second work, we cluster static collections of short texts where the number of clusters to be produced is known. However, in real time, short texts are
continuously being generated in large volumes from different sources. This motivates us to develop an efficient dynamic clustering method that creates or updates the clusters
when text collection changes over time (e.g., new tweets arrive or new questions are posted on a question-answering site). In this method, we index clusters to reduce
the number of similarity computations while assigning a text to a cluster. Using our dynamic clustering method along with static clustering, we cluster Stack Overflow
questions as they arrive over time. Using the clusters of questions, we recommend potential duplicates of a newly posted question.
Experimental studies demonstrate that both our static and dynamic clustering methods of short texts perform better than that of the existing state-of-the-art methods
in terms of clustering quality and running time on several short text datasets. We also demonstrate that by using the clusters obtained by our clustering method,
we find more duplicate questions than an existing duplicate question finding system.
Description
Keywords
SHORT TEXT CLUSTERING, STREAM CLUSTERING, DUPLICATE QUESTION FINDING, STACK OVERFLOW