Repository logo
 

TEXT DOCUMENT SIMILARITIES BASED ON WIKIPEDIA CONCEPT RELATEDNESS

Date

2015

Authors

Wang, Xiangru

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Traditionally, text document similarity is based on lexical overlap between documents. Documents are represented based on bag of words (BOW), which ignores the relatedness among terms. One existing method to address this problem is to use external resources to enhance the BOW representation. Documents are represented by the background knowledge derived from external resources to create bag of concepts (BOC). Then BOC is used along with or instead of BOW to make a new representation. However, this approach assumes concepts to be independent, which is known as the orthogonality assumption. This work focuses on developing new semantic similarity measures. By employing Wikipedia as the knowledge resource to create a BOC model, we get document similarities by following different concept mapping procedures combined with concept relatedness. We evaluate proposed measures in text clustering. Experimental results show that our BOC based similarity method can improve clustering performance.

Description

Keywords

Wikipedia, Document Clustering, Semantic Similarity

Citation