TEXT DOCUMENT SIMILARITIES BASED ON WIKIPEDIA CONCEPT RELATEDNESS

Wang, Xiangru

View/Open

Wang-Xiangru-MCSc-CSCI-July-2015.pdf (3.356Mb)

Date

2015

Author

Wang, Xiangru

Metadata

Show full item record

Abstract

Traditionally, text document similarity is based on lexical overlap between documents. Documents are represented based on bag of words (BOW), which ignores the relatedness among terms. One existing method to address this problem is to use external resources to enhance the BOW representation. Documents are represented by the background knowledge derived from external resources to create bag of concepts (BOC). Then BOC is used along with or instead of BOW to make a new representation. However, this approach assumes concepts to be independent, which is known as the orthogonality assumption. This work focuses on developing new semantic similarity measures. By employing Wikipedia as the knowledge resource to create a BOC model, we get document similarities by following different concept mapping procedures combined with concept relatedness. We evaluate proposed measures in text clustering. Experimental results show that our BOC based similarity method can improve clustering performance.

URI

http://hdl.handle.net/10222/59903

Subject

Collections

Faculty of Graduate Studies Online Theses

Find Full text