TEXT DOCUMENT SIMILARITIES BASED ON WIKIPEDIA CONCEPT RELATEDNESS
MetadataShow full item record
Traditionally, text document similarity is based on lexical overlap between documents. Documents are represented based on bag of words (BOW), which ignores the relatedness among terms. One existing method to address this problem is to use external resources to enhance the BOW representation. Documents are represented by the background knowledge derived from external resources to create bag of concepts (BOC). Then BOC is used along with or instead of BOW to make a new representation. However, this approach assumes concepts to be independent, which is known as the orthogonality assumption. This work focuses on developing new semantic similarity measures. By employing Wikipedia as the knowledge resource to create a BOC model, we get document similarities by following different concept mapping procedures combined with concept relatedness. We evaluate proposed measures in text clustering. Experimental results show that our BOC based similarity method can improve clustering performance.