CONTEXT-AWARE SEMANTIC TEXT MINING AND REPRESENTATION LEARNING FOR TEXT DISAMBIGUATION AND ONLINE HARASSMENT CLASSIFICATION
Date
2023-12-15
Authors
Saeidi, Mozhgan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This dissertation presents a new method for text representation learning and applies it to two Natural Language Processing (NLP) problems, namely, word sense disambiguation and text classification. Word Sense Disambiguation (WSD) is a problem in NLP when there are different possible meanings for words present in the text. These possible meanings are extracted from a knowledge base. The correct meaning of a word in the text can be identified based on surrounding words and prior knowledge. When Wikipedia serves as the knowledge base, this problem is referred to as Wikification. We provide two algorithms for solving the Wikification problem by segmenting the text and assigning weights to different meanings of a word based on their context's relevancy. For the WSD problem, we study the role of representation learning in the final output of the WSD algorithm and incorporate our novel representation learning approach. We use our method when solving the WSD problem with the 1-nearest-neighbor algorithm and demonstrate that our representations work better than the state-of-the-art models in the WSD task. We evaluate our novel representation method on general English and biomedical texts. The results demonstrate that, by considering context from various sources in representations, the results of the WSD task can be improved.
Text classification is the second NLP problem that we study. We consider a collection of tweet posts and classify them into two groups of tweets, harassment versus non-harassment. This binary classification task is addressed with standard supervised methods. Next, we focus on categorizing harassment tweets into specified harassment types, for which we combine our novel text representation with a graph convolutional network. In experiments, we demonstrate the effectiveness of our approach by comparing it with other language models and classical representation models.
Description
PhD thesis
Keywords
Natural Language Processing, Machine Learning, Large Language Models, Word Sense Disambiguation, Wikification, Text Classification, Representation Learning, Deep Learning