Show simple item record

dc.contributor.authorBalfagih, Ahmed
dc.date.accessioned2023-08-10T12:33:45Z
dc.date.available2023-08-10T12:33:45Z
dc.date.issued2023-08-08
dc.identifier.urihttp://hdl.handle.net/10222/82771
dc.description.abstractFeature engineering is one of the essential steps in machine learning. It helps to create a better data model and provides reasonable learning results. In natural language processing, feature engineering becomes even more critical if we can generate many features and feed them to our model. Twitter data became an important source of written natural language used to train and test machine learning models, and preparing features for these models is an important step. In this study, we examine different feature engineering approaches to three tasks on Twitter. In all these three studies, we compared the efficiency of generating word embedding features, with the generated n-grams of the Tweets, by applying machine learning algorithms and noticing the experiment's accuracy. First, we generate features of tweets for financial forecasting; second, we generate features for spam recognition on tweets; and finally, we generate features to learn different dialects of a language. In the financial forecasting task, we explore the relationship between the Twitter feed on Bitcoin, its sentiment analysis, and its price prediction. Two approaches are used to generate features from tweets: tweet embedding and n-gram modeling. Using a time-series approach, we found a partial correlation between Bitcoin price fluctuation and sentiment class accuracy fluctuations using different machine learning algorithms. In the spam recognition task, we apply machine learning techniques to identify spam on tweets in the Arabic language. We use two feature generation techniques: n-grams and Word2Vec embeddings. The experimental results show improvement from using Word2Vec embeddings over n-grams in more balanced datasets versus the more unbalanced ones. Finally, in the dialects recognition task, we used tf-idf n-grams, AraVec feature embedding, and BERT fine-tuned generated features, on Arabic tweets to locate where the tweet came from in the Arabian region based on dialect learning. The results showed that BERT-Base pretrained language models are more powerful for this study than other methods, specifically ARABERT and MARBERT.en_US
dc.language.isoenen_US
dc.subjectNatural Language Processingen_US
dc.subjectFeature Engineeringen_US
dc.subjectSocial Media Dataen_US
dc.titleADDRESSING CHALLENGES OF TWITTER FEATURE ENGINEERING FOR MACHINE LEARNING IN DIFFERENT DOMAINSen_US
dc.date.defence2023-07-14
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerMalek Mouhouben_US
dc.contributor.graduate-coordinatorMichael McAllisteren_US
dc.contributor.thesis-readerSrinivas Sampallien_US
dc.contributor.thesis-readerQigang Gaoen_US
dc.contributor.thesis-readerMalcolm Heywooden_US
dc.contributor.thesis-supervisorVlado Keseljen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record