Interactive text analytics for document clustering
MetadataShow full item record
Clustering has been widely used to efficiently get insight into text collections containing more documents than a human can effectively read. Although there exist several different document clustering algorithms, most of them do not consider user preferences. Due to the personalized nature of document clustering, even best algorithms cannot create clusters that accurately reflect the user's perspectives. On the other hand, it is necessary to visualize the results of clustering to be easily interpretable by the human. In this thesis, we revisit the problem of document clustering to incorporate the user's perspective in the clustering process and effectively visualize data in the process of being clustered to enhance user's sense-making of the data. First, we design clustering algorithms that are interactive and can adapt to the user's feedback. Second, a collection of coordinated visualization modules and document projection is designed to guide the user towards a better insight into the document collection and the clustering algorithm results. It has been demonstrated that exploiting external knowledge sources such as Wikipedia can help the clustering algorithm to consider the semantic similarity between documents. The process of linking terms and phrases of a document to the related Wikipedia page is called Wikification of a document. To help the process of Wikification, we introduce a model to extract high-quality distributed vector representations for each Wikipedia page. Finally, we considered the temporal similarity between documents and introduced a couple of visualization modules to depict the temporal aspect of clusters. This has enabled us to study the dynamics of document clusters over time. A set of quantitative experiments, use cases, and a user study has been conducted on real-world datasets to show the advantages of interactive visual analytics clustering.