Repository logo
 

Author and Language Profiling of Short Texts

Date

2020-04-07T12:40:24Z

Authors

Kosmajac, Dijana

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Over the past couple of decades, the advancement and growth of digital information and communication technologies have resulted in information explosion and these technologies are profoundly changing all aspects of modern society. The popularization of the Internet and mobile technologies fueled the rise of social media, providing technological platforms for information spreading, content generation, and interactive communication, which has been contributing to the global data growth. Additionally, social media have become one of the main outlets for obtaining information about latest news, people, businesses, services, etc. The research on it has gained traction having in mind the growing interest in the applications and related technical and social science challenges and opportunities. One of the big challenges of the widespread online textual data is the structure and size. Structurally, it is not in proper grammatical form, has slang, emoticons, improper sentences, which is the standard way we communicate daily. Size-wise, the text is usually very short. However, this is not only the case with the online data; medical notes, open-ended survey questions, various old-school maintenance reports are just some of the examples. We particularly focus on the problem of author profiling on short texts in three different domains. Automatic author profiling is a set of methods to determine an author's (or group of authors') gender, age, native language, personality type and similar, which can be useful in different application contexts such as forensics, security, marketing, product personalisation, socio-demographic analyses and so on. In the first task, we explore fine-grained language dialect/variety identification and propose a new feature weighting scheme. In the second task, we work on bot detection on social media and propose a simple, but efficient method based on statistical diversity measures. In the third task, we present some interesting findings on topic modelling in relation to author on open-ended survey questions from the Canadian Longitudinal Study on Aging (CLSA).

Description

Keywords

Language Identification (LID), Social Media Bot Detection, Topic Modelling on Short Texts, Social Media Mining

Citation