INTERACTIVE TERM SUPERVISED TEXT DOCUMENT CLUSTERING
Abstract
Text document clustering has broad applications in practice. For instance, a conference
chair should place accepted papers into meaningful sessions. Students writing a thesis,
or professors writing a proposal or planning a reading course need to organize their
reference papers. Organizing documents into folders on a personal computer, or
grouping emails into multiple inboxes are other instances of document clustering.
Unsupervised document clustering algorithms require no user effort, but the obtained
partitionings may be far from what the user intended to generate. User-supervised
clustering algorithms involve the user in the clustering process and let her decide
on the numb er and topics of document clusters. Generating useful clusters with
minimum user effort is the main challenge in this mode. To address this challenge,
we propose a user-supervised clustering algorithm, designed in three stages. First,
we design a novel unsupervised clustering algorithm that can b e easily extended into
a user-supervised algorithm, thanks to its double clustering approach. We evaluate
its performance against state-of-the-art clustering algorithms in unsupervised mode.
We also extend this algorithm into an ensemble algorithm to incorporate Wikipedia
concepts in document representation. We demonstrate that the integration can improve
the quality of document clusters even though representing documents by Wikipedia
concepts solely, may result in inferior clusterings to bag of words representation.
Second, we propose three user-supervised versions for our clusterer based on term
supervision (in the form of term labeling), document supervision, and dual supervision.
We then demonstrate that with a comparable amount of simulated user effort, our
proposed term labeling is more effective than a baseline term selection method. Third,
we propose a graphical interface to support our term-supervised clusterer in interaction
with human users. We then conduct a user study to evaluate the interface and its
underlying clusterer. Analyzing the participants’ opinions and comments reveals the
usefulness of the proposed term-supervised clustering algorithm.
Collections
Related items
Showing items related by title, author, creator and subject.
-
The M33 Globular Cluster System with PAndAS Data: The Last Outer Halo Cluster?
Cockcroft, Robert; Harris, William E.; Ferguson, Annette M. N.; Huxor, Avon; Ibata, Rodrigo; Irwin, Mike J.; McConnachie, Alan W.; Woodley, Kristin A.; Chapman, Scott C.; Lewis, Geraint F.; Puzia, Thomas H. (2011-01-12)We use CFHT/MegaCam data to search for outer halo star clusters in M33 as part of the Pan-Andromeda Archaeological Survey (PAndAS). This work extends previous studies out to a projected radius of 50 kpc and covers over 40 ... -
Exploring the Properties of the M31 Halo Globular Cluster System
Huxor, A. P.; Ferguson, A. M. N.; Tanvir, N. R.; Irwin, M. J.; Mackey, A. D.; Ibata, R. A.; Bridges, T.; Chapman, S. C.; Lewis, G. F. (2011-02-02)Following on from our discovery of a significant population of M31 outer halo globular clusters (GCs), and updates to the Revised Bologna Catalogue of M31 GCs, we investigate the GC system of M31 out to an unprecedented ... -
Do globular clusters possess Dark Matter halos? A case study in NGC 2419
Ibata, Rodrigo; Nipoti, Carlo; Sollima, Antonio; Bellazzini, Michele; Chapman, Scott; Dalessandro, Emanuele (2012-10-29)We use recently published measurements of the kinematics, surface brightness and stellar mass-to-light ratio of the globular cluster NGC 2419 to examine the possibility that this Galactic halo satellite is embedded in a ...