Repository logo
 

EFFICIENT CLUSTERING OF SHORT TEXT STREAMS WITH AN APPLICATION TO FIND DUPLICATE QUESTIONS IN STACK OVERFLOW

Date

2023-12-18

Authors

Rakib, Md Rashadul Hasan

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis focuses on the efficient clustering of short texts along with an application to find duplicate questions in Stack Overflow. In the first part of this thesis, we discuss static and dynamic clustering methods for short text corpora. In the second part, we discuss how we can apply static and dynamic clustering of short texts to find duplicate questions in Stack Overflow. Short text clustering is an important but challenging task due to the lack of context contained in short texts. In our first work, we overcome this problem by representing text using word embedding which allows us to capture similarity between texts sharing few or no common words. In addition, we investigate the impact of similarity matrix sparsification on the performance of short text clustering. In our second work, we improve the clustering result obtained from our first work by removing outliers from clusters and reclassifying them to proper clusters. This is repeated several times until the cluster partitions stabilize. In our first and second work, we cluster static collections of short texts where the number of clusters to be produced is known. However, in real time, short texts are continuously being generated in large volumes from different sources. This motivates us to develop an efficient dynamic clustering method that creates or updates the clusters when text collection changes over time (e.g., new tweets arrive or new questions are posted on a question-answering site). In this method, we index clusters to reduce the number of similarity computations while assigning a text to a cluster. Using our dynamic clustering method along with static clustering, we cluster Stack Overflow questions as they arrive over time. Using the clusters of questions, we recommend potential duplicates of a newly posted question. Experimental studies demonstrate that both our static and dynamic clustering methods of short texts perform better than that of the existing state-of-the-art methods in terms of clustering quality and running time on several short text datasets. We also demonstrate that by using the clusters obtained by our clustering method, we find more duplicate questions than an existing duplicate question finding system.

Description

Keywords

SHORT TEXT CLUSTERING, STREAM CLUSTERING, DUPLICATE QUESTION FINDING, STACK OVERFLOW

Citation