Repository logo
 

Authorship Attribution using Written and Read Documents

Date

2019-08-07T17:50:35Z

Authors

Gujarati, Afsan

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In Authorship Attribution (AA), a task of identifying the author on an unseen document, it is often hard to obtain large amounts of training text written by an author. In our research, we analyze the influence of the size of training data and we propose a novel alternative of using the documents read by the authors for the AA task. Although it becomes significantly more difficult to identify the author of an unseen document with less written data, classification performance can be drastically improved by using the documents read by the author. The Support Vector Machine method outperformed all the classifiers in the presence of the read documents with an average accuracy of 94.35%, a 23.57% increase after the addition of the read documents. It was found through the feature analysis that there exists a semantic similarity between the written and the read documents that played an important role in improved performance.

Description

Keywords

Authorship attribution, machine learning, document classification, natural language processing, n-grams approach, data processing, data collection, limited training data, read documents

Citation