Author Style Analysis in Text Documents Based on Character and Word N-Grams

Jankowska, Magdalena

dc.contributor.author	Jankowska, Magdalena
dc.date.accessioned	2017-04-27T15:37:03Z
dc.date.available	2017-04-27T15:37:03Z
dc.date.issued	2017-04-27T15:37:03Z
dc.identifier.uri	http://hdl.handle.net/10222/72872
dc.description.abstract	We describe our research on text analytics methods for detecting differences and similarities in the style of authors of text documents. Automatic methods for analyzing the written style of authors have applications in the domains of forensics, plagiarism detection, security, and literary research. We present our method for the problem of authorship verification, that is, the problem of deciding whether a certain text was written by a specific person, given samples of their writing. Our proximity-based one-class classifier method is evaluated on a multilingual dataset of the Author Identification competition of PAN 2013 shared tasks on digital text forensics. A version of our method submitted to the task was the winner in the competition’s secondary evaluation. We also propose a visual analytics tool RNG-Sig for investigation of differences and similarities between text documents at the level of features that have been shown to be powerful for identification of authorship, that is at the level of character n-grams. The tool provides a visual interface for performing classification for authorship attribution — the task of deciding who among candidate authors wrote a considered text, based on samples of writing of the candidates — using CNG classifier proposed by Keselj et al. RNG-Sig allows for the visual interpretation of the inner workings of the classifier and for influencing the classification process by a user. Further, we systematically study authorship attribution in the situation when samples of writing of different candidates have different levels of topical similarity to a text that is attributed. We investigate how such a condition influences the behaviour of two supervised classifiers on two sets of features commonly used for the task, and we show that supervised models are biased towards attributing a questioned document to a candidate that has writing samples topically more similar to the document. We propose a method of character n-gram selection that alleviate this bias of classifiers.	en_US
dc.language.iso	en	en_US
dc.subject	text mining	en_US
dc.subject	Visual texture recognition	en_US
dc.subject	authorship analysis	en_US
dc.subject	machine learning	en_US
dc.subject	Data mining
dc.title	Author Style Analysis in Text Documents Based on Character and Word N-Grams	en_US
dc.date.defence	2017-04-19
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	Dr. Diana Inkpen	en_US
dc.contributor.graduate-coordinator	Dr. Malcolm Heywood	en_US
dc.contributor.thesis-reader	Dr. Stan Matwin	en_US
dc.contributor.thesis-reader	Dr. Stephen Brooks	en_US
dc.contributor.thesis-supervisor	Dr. Evangelos Milios	en_US
dc.contributor.thesis-supervisor	Dr. Vlado Keselj	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Yes	en_US
dc.contributor.copyright-release	Yes	en_US

Find Full text

Files in this item

Name:: Jankowska-Magdalena-PhD-CSCI-A ...
Size:: 2.135Mb
Format:: PDF
Description:: PhD thesis

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record