Author Style Analysis in Text Documents Based on Character and Word N-Grams
Date
2017-04-27T15:37:03Z
Authors
Jankowska, Magdalena
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We describe our research on text analytics methods for detecting differences and
similarities in the style of authors of text documents. Automatic methods for analyzing
the written style of authors have applications in the domains of forensics, plagiarism
detection, security, and literary research. We present our method for the problem of
authorship verification, that is, the problem of deciding whether a certain text was
written by a specific person, given samples of their writing. Our proximity-based
one-class classifier method is evaluated on a multilingual dataset of the Author
Identification competition of PAN 2013 shared tasks on digital text forensics. A version
of our method submitted to the task was the winner in the competition’s secondary
evaluation. We also propose a visual analytics tool RNG-Sig for investigation of
differences and similarities between text documents at the level of features that have
been shown to be powerful for identification of authorship, that is at the level of
character n-grams. The tool provides a visual interface for performing classification for
authorship attribution — the task of deciding who among candidate authors wrote a
considered text, based on samples of writing of the candidates — using CNG classifier
proposed by Keselj et al. RNG-Sig allows for the visual interpretation of the inner
workings of the classifier and for influencing the classification process by a user.
Further, we systematically study authorship attribution in the situation when samples
of writing of different candidates have different levels of topical similarity to a text
that is attributed. We investigate how such a condition influences the behaviour of
two supervised classifiers on two sets of features commonly used for the task, and we
show that supervised models are biased towards attributing a questioned document
to a candidate that has writing samples topically more similar to the document. We
propose a method of character n-gram selection that alleviate this bias of classifiers.
Description
Keywords
text mining, Visual texture recognition, authorship analysis, machine learning, Data mining