Repository logo
 

ADDRESSING CHALLENGES OF TWITTER FEATURE ENGINEERING FOR MACHINE LEARNING IN DIFFERENT DOMAINS

Date

2023-08-08

Authors

Balfagih, Ahmed

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Feature engineering is one of the essential steps in machine learning. It helps to create a better data model and provides reasonable learning results. In natural language processing, feature engineering becomes even more critical if we can generate many features and feed them to our model. Twitter data became an important source of written natural language used to train and test machine learning models, and preparing features for these models is an important step. In this study, we examine different feature engineering approaches to three tasks on Twitter. In all these three studies, we compared the efficiency of generating word embedding features, with the generated n-grams of the Tweets, by applying machine learning algorithms and noticing the experiment's accuracy. First, we generate features of tweets for financial forecasting; second, we generate features for spam recognition on tweets; and finally, we generate features to learn different dialects of a language. In the financial forecasting task, we explore the relationship between the Twitter feed on Bitcoin, its sentiment analysis, and its price prediction. Two approaches are used to generate features from tweets: tweet embedding and n-gram modeling. Using a time-series approach, we found a partial correlation between Bitcoin price fluctuation and sentiment class accuracy fluctuations using different machine learning algorithms. In the spam recognition task, we apply machine learning techniques to identify spam on tweets in the Arabic language. We use two feature generation techniques: n-grams and Word2Vec embeddings. The experimental results show improvement from using Word2Vec embeddings over n-grams in more balanced datasets versus the more unbalanced ones. Finally, in the dialects recognition task, we used tf-idf n-grams, AraVec feature embedding, and BERT fine-tuned generated features, on Arabic tweets to locate where the tweet came from in the Arabian region based on dialect learning. The results showed that BERT-Base pretrained language models are more powerful for this study than other methods, specifically ARABERT and MARBERT.

Description

Keywords

Natural Language Processing, Feature Engineering, Social Media Data

Citation