Document Clustering with Dual Supervision
MetadataShow full item record
Nowadays, academic researchers maintain a personal library of papers, which they would like to organize based on their needs, e.g., research, projects, or courseware. Clustering techniques are often employed to achieve this goal by grouping the document collection into different topics. Unsupervised clustering does not require any user effort but only produces one universal output with which users may not be satisfied. Therefore, document clustering needs user input for guidance to generate personalized clusters for different users. Semi-supervised clustering incorporates prior information and has the potential to produce customized clusters. Traditional semi-supervised clustering is based on user supervision in the form of labeled instances or pairwise instance constraints. However, alternative forms of user supervision exist such as labeling features. For document clustering, document supervision involves labeling documents while feature supervision involves labeling features. Their joint of use has been called dual supervision. In this thesis, we first explore and propose a framework to use feature supervision for interactive feature selection by indicating whether a feature is useful for clustering. Second, we enhance the semi-supervised clustering with feature supervision using feature reweighting. Third, we propose a unified framework to combine document supervision and feature supervision through seeding. The newly proposed algorithms are evaluated using oracles and demonstrated to be more helpful in producing better clusters matching a single user's point of view than document clustering without any supervision and with only document supervision. Finally, we conduct a user study to confirm that different users have different understandings of the same document collection and prefer personalized clusters. At the same time, we demonstrate that document clustering with dual supervision is able to produce good personalized clusters even with noisy user input. Dual supervision is also demonstrated to be more effective in personalized clustering than no supervision or any single supervision. We also analyze users' behaviors during the user study and present suggestions for the design of document management software.