Learning Embeddings for Text and Images from Structure of the Data

Haji Soleimani, Behrouz

dc.contributor.author	Haji Soleimani, Behrouz
dc.date.accessioned	2019-06-25T12:55:17Z
dc.date.available	2019-06-25T12:55:17Z
dc.date.issued	2019-06-25T12:55:17Z
dc.identifier.uri	http://hdl.handle.net/10222/75870
dc.description.abstract	In the big data era, most of the data generated every day are high dimensional such as text and image data. Learning compact representations from the input data can help in dealing with the high dimensionality and visualization of the data. These representations can be learned to map the input space into a latent space in which, e.g., complex relationships present in the data become more explicit, while preserving the structure and geometrical properties of the data. In the case of discrete input features such as text data, embedding algorithms try to learn a vector space representation of inputs while preserving certain aspects of similarity or relatedness. Training these embeddings is time-consuming and requires a large corpus. This makes it difficult to train task-specific embeddings due to insufficient data in the downstream tasks which has given rise to pre-trained models and embeddings. In this thesis, we first study dimensionality reduction methods and propose Unit Ball Embedding (UBE), a spherical representation learning method that learns the structure of manifolds and maximizes their separability, which makes it suitable for both clustering and visualization. We then generalize the algorithm and apply it to learn word embeddings, taking into account the contexts of individual words. Our word embedding solution learns a better distribution of words in the latent space by pushing unrelated words away from each other. We investigate and address the naive procedure of negative sampling in Word2Vec and exploit the full capacity of negative examples. We also analyze frequency-based and spectral word embedding methods and show how the idea of negative context can be used in these types of algorithms. We propose EigenWord, a spectral word embedding with an intuitive formulation that makes use of negative examples, and theoretically show that it has an optimal closed-form solution. Finally, we tackle the Word Sense Disambiguation (WSD) problem and propose a multi-sense embedding algorithm based on EigenWord. We evaluated the proposed algorithms using both intrinsic and extrinsic evaluation of embeddings. Extensive experiments on word similarity datasets, emotion recognition from tweets, and sentiment analysis show the effectiveness of our proposed methods.	en_US
dc.language.iso	en	en_US
dc.subject	Embedding	en_US
dc.subject	Deep Learning	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.subject	Word Embedding	en_US
dc.subject	Matrix Factorization	en_US
dc.subject	Singular Value Decomposition	en_US
dc.subject	Computer Vision	en_US
dc.subject	Dimensionality Reduction	en_US
dc.subject	Representation Learning	en_US
dc.title	Learning Embeddings for Text and Images from Structure of the Data	en_US
dc.date.defence	2019-06-13
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	Dr. Jackie Chi Kit Cheung	en_US
dc.contributor.graduate-coordinator	Dr. Michael McAllister	en_US
dc.contributor.thesis-reader	Dr. Daniel Silver	en_US
dc.contributor.thesis-reader	Dr. Sageev Oore	en_US
dc.contributor.thesis-reader	Dr. Vlado Keselj	en_US
dc.contributor.thesis-supervisor	Dr. Stan Matwin	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.copyright-release	Yes	en_US

Find Full text

Files in this item

Name:: HajiSoleimani-Behrouz-PhD-CS-J ...
Size:: 4.810Mb
Format:: PDF
Description:: PhD Thesis

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record