Compromised Tweet Detection Using weighted sub-word embeddings

Joshi, Mihir

Compromised Tweet Detection Using weighted sub-word embeddings

Files

Joshi-Mihir-MCS-CSCI-July-2019.pdf (1000.37 KB)

Date

2019-08-07T18:18:46Z

Authors

Joshi, Mihir

Abstract

Extracting features and writing styles from short text messages for compromised tweet detection is always a challenge. Short messages, such as tweets, do not have enough data to perform statistical authorship attribution. Besides, the vocabulary used in these texts is sometimes improvised or misspelled. Therefore, in this thesis, I propose combining four feature extraction techniques namely character n-grams, word n-grams, Flexible Patterns and a new sub-word embedding using the skip-gram model. The proposed system uses a Multi-Layer Perceptron to utilize these features from tweets to analyze short text messages. This proposed system achieves 85\% accuracy, which is a considerable improvement over previous systems. Furthermore, Siamese networks are employed to model the representation of user tweets in order to identify them based on a limited amount of ground truth data. The results show that the proposed system achieves a promising accuracy as the number of authors increase.

Keywords

Natural Language Processing, Machine Learning, Security Management

URI

http://hdl.handle.net/10222/76216

Collections

Faculty of Graduate Studies Online Theses

Full item page

Compromised Tweet Detection Using weighted sub-word embeddings

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections