Show simple item record

dc.contributor.authorJankowska, Magdalena
dc.date.accessioned2017-04-27T15:37:03Z
dc.date.available2017-04-27T15:37:03Z
dc.date.issued2017-04-27T15:37:03Z
dc.identifier.urihttp://hdl.handle.net/10222/72872
dc.description.abstractWe describe our research on text analytics methods for detecting differences and similarities in the style of authors of text documents. Automatic methods for analyzing the written style of authors have applications in the domains of forensics, plagiarism detection, security, and literary research. We present our method for the problem of authorship verification, that is, the problem of deciding whether a certain text was written by a specific person, given samples of their writing. Our proximity-based one-class classifier method is evaluated on a multilingual dataset of the Author Identification competition of PAN 2013 shared tasks on digital text forensics. A version of our method submitted to the task was the winner in the competition’s secondary evaluation. We also propose a visual analytics tool RNG-Sig for investigation of differences and similarities between text documents at the level of features that have been shown to be powerful for identification of authorship, that is at the level of character n-grams. The tool provides a visual interface for performing classification for authorship attribution — the task of deciding who among candidate authors wrote a considered text, based on samples of writing of the candidates — using CNG classifier proposed by Keselj et al. RNG-Sig allows for the visual interpretation of the inner workings of the classifier and for influencing the classification process by a user. Further, we systematically study authorship attribution in the situation when samples of writing of different candidates have different levels of topical similarity to a text that is attributed. We investigate how such a condition influences the behaviour of two supervised classifiers on two sets of features commonly used for the task, and we show that supervised models are biased towards attributing a questioned document to a candidate that has writing samples topically more similar to the document. We propose a method of character n-gram selection that alleviate this bias of classifiers.en_US
dc.language.isoenen_US
dc.subjecttext miningen_US
dc.subjectVisual texture recognitionen_US
dc.subjectauthorship analysisen_US
dc.subjectmachine learningen_US
dc.subjectData mining
dc.titleAuthor Style Analysis in Text Documents Based on Character and Word N-Gramsen_US
dc.date.defence2017-04-19
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerDr. Diana Inkpenen_US
dc.contributor.graduate-coordinatorDr. Malcolm Heywooden_US
dc.contributor.thesis-readerDr. Stan Matwinen_US
dc.contributor.thesis-readerDr. Stephen Brooksen_US
dc.contributor.thesis-supervisorDr. Evangelos Miliosen_US
dc.contributor.thesis-supervisorDr. Vlado Keseljen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsYesen_US
dc.contributor.copyright-releaseYesen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record