Author Style Analysis in Text Documents Based on Character and Word N-Grams
We describe our research on text analytics methods for detecting differences and similarities in the style of authors of text documents. Automatic methods for analyzing the written style of authors have applications in the domains of forensics, plagiarism detection, security, and literary research. We present our method for the problem of authorship verification, that is, the problem of deciding whether a certain text was written by a specific person, given samples of their writing. Our proximity-based one-class classifier method is evaluated on a multilingual dataset of the Author Identification competition of PAN 2013 shared tasks on digital text forensics. A version of our method submitted to the task was the winner in the competition’s secondary evaluation. We also propose a visual analytics tool RNG-Sig for investigation of differences and similarities between text documents at the level of features that have been shown to be powerful for identification of authorship, that is at the level of character n-grams. The tool provides a visual interface for performing classification for authorship attribution — the task of deciding who among candidate authors wrote a considered text, based on samples of writing of the candidates — using CNG classifier proposed by Keselj et al. RNG-Sig allows for the visual interpretation of the inner workings of the classifier and for influencing the classification process by a user. Further, we systematically study authorship attribution in the situation when samples of writing of different candidates have different levels of topical similarity to a text that is attributed. We investigate how such a condition influences the behaviour of two supervised classifiers on two sets of features commonly used for the task, and we show that supervised models are biased towards attributing a questioned document to a candidate that has writing samples topically more similar to the document. We propose a method of character n-gram selection that alleviate this bias of classifiers.