Show simple item record

dc.contributor.authorRakib, Md Rashadul Hasan
dc.date.accessioned2023-12-19T16:00:29Z
dc.date.available2023-12-19T16:00:29Z
dc.date.issued2023-12-18
dc.identifier.urihttp://hdl.handle.net/10222/83315
dc.description.abstractThis thesis focuses on the efficient clustering of short texts along with an application to find duplicate questions in Stack Overflow. In the first part of this thesis, we discuss static and dynamic clustering methods for short text corpora. In the second part, we discuss how we can apply static and dynamic clustering of short texts to find duplicate questions in Stack Overflow. Short text clustering is an important but challenging task due to the lack of context contained in short texts. In our first work, we overcome this problem by representing text using word embedding which allows us to capture similarity between texts sharing few or no common words. In addition, we investigate the impact of similarity matrix sparsification on the performance of short text clustering. In our second work, we improve the clustering result obtained from our first work by removing outliers from clusters and reclassifying them to proper clusters. This is repeated several times until the cluster partitions stabilize. In our first and second work, we cluster static collections of short texts where the number of clusters to be produced is known. However, in real time, short texts are continuously being generated in large volumes from different sources. This motivates us to develop an efficient dynamic clustering method that creates or updates the clusters when text collection changes over time (e.g., new tweets arrive or new questions are posted on a question-answering site). In this method, we index clusters to reduce the number of similarity computations while assigning a text to a cluster. Using our dynamic clustering method along with static clustering, we cluster Stack Overflow questions as they arrive over time. Using the clusters of questions, we recommend potential duplicates of a newly posted question. Experimental studies demonstrate that both our static and dynamic clustering methods of short texts perform better than that of the existing state-of-the-art methods in terms of clustering quality and running time on several short text datasets. We also demonstrate that by using the clusters obtained by our clustering method, we find more duplicate questions than an existing duplicate question finding system.en_US
dc.language.isoenen_US
dc.subjectSHORT TEXT CLUSTERINGen_US
dc.subjectSTREAM CLUSTERINGen_US
dc.subjectDUPLICATE QUESTION FINDINGen_US
dc.subjectSTACK OVERFLOWen_US
dc.titleEFFICIENT CLUSTERING OF SHORT TEXT STREAMS WITH AN APPLICATION TO FIND DUPLICATE QUESTIONS IN STACK OVERFLOWen_US
dc.typeThesisen_US
dc.date.defence2023-11-22
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerDiana Inkpenen_US
dc.contributor.thesis-readerFernando Paulovichen_US
dc.contributor.thesis-readerVlado Keseljen_US
dc.contributor.thesis-supervisorNorbert Zehen_US
dc.contributor.thesis-supervisorEvangelos Miliosen_US
dc.contributor.manuscriptsYesen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record