Authorship Attribution using Written and Read Documents
Date
2019-08-07T17:50:35Z
Authors
Gujarati, Afsan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In Authorship Attribution (AA), a task of identifying the author on an unseen document, it is often hard to obtain large amounts of training text written by an author. In our research, we analyze the influence of the size of training data and we propose a novel alternative of using the documents read by the authors for the AA task.
Although it becomes significantly more difficult to identify the author of an unseen document with less written data, classification performance can be drastically improved by using the documents read by the author. The Support Vector Machine method outperformed all the classifiers in the presence of the read documents with an average accuracy of 94.35%, a 23.57% increase after the addition of the read documents. It was found through the feature analysis that there exists a semantic similarity between the written and the read documents that played an important role in improved performance.
Description
Keywords
Authorship attribution, machine learning, document classification, natural language processing, n-grams approach, data processing, data collection, limited training data, read documents