Using Named Entities in Post-click News Recommendation
Abstract
With the growth of online news readers, many news websites use different signals to attract users' initial clicks. However, the problem of keeping users in the web site through post-click news recommendation is relatively under explored. To address this problem, we try to find the news articles related to the one that a user is currently reading based on the content of the articles while no history or user profile is assumed. The problem is very similar to a typical information retrieval problem in which the system finds related documents to a given query ranked by a similarity function which produces a relatedness score between a document and the query. However, we con-ducted experiments to show that \relatedness" is not equivalent to similarity as in information retrieval. As a relatedness function, we used the semantic similarity of named entities extracted from the body of news articles in a combination with lexical similarity functions available through information retrieval systems. A new system called Tulip was used as the named entity recognition and disambiguation system and the word skip-gram model was used for finding similarity of named entities. Tulip provides precise recognition of named entities and very fast response time. Additionally, a stochastic keyword extraction algorithm based on the Chinese restaurant process and the word skip-gram model was proposed to capture topical similarity of two articles. To solve problem practically, we proposed using the cosine similarity of TF-IDF vectors of articles as a lter to narrow down the search space, given one article as a query. Then we applied the relatedness function to the results returned by cosine similarity. In other words, we proposed a relatedness function to re-rank the results extracted from a typical retrieval system. Due to the nature of the problem and available datasets, we proposed a graph based approach as an unsupervised approach for labeling pairs of documents during both training and testing. We trained and tested our method on two datasets against the cosine similarity of TF-IDF vectors as the baseline before testing it by domain experts. The model trained on our proposed features is demonstrated to outperform the baseline. Finally we conducted a series of experiments to rank the importance of different features. Based on our observations, semantic similarity of named entities along with Information Based lexical similarity (included in Lucene) are more effective than other lexical features and provide better ranking for the related news.