Learning Embeddings for Text and Images from Structure of the Data
In the big data era, most of the data generated every day are high dimensional such as text and image data. Learning compact representations from the input data can help in dealing with the high dimensionality and visualization of the data. These representations can be learned to map the input space into a latent space in which, e.g., complex relationships present in the data become more explicit, while preserving the structure and geometrical properties of the data. In the case of discrete input features such as text data, embedding algorithms try to learn a vector space representation of inputs while preserving certain aspects of similarity or relatedness. Training these embeddings is time-consuming and requires a large corpus. This makes it difficult to train task-specific embeddings due to insufficient data in the downstream tasks which has given rise to pre-trained models and embeddings. In this thesis, we first study dimensionality reduction methods and propose Unit Ball Embedding (UBE), a spherical representation learning method that learns the structure of manifolds and maximizes their separability, which makes it suitable for both clustering and visualization. We then generalize the algorithm and apply it to learn word embeddings, taking into account the contexts of individual words. Our word embedding solution learns a better distribution of words in the latent space by pushing unrelated words away from each other. We investigate and address the naive procedure of negative sampling in Word2Vec and exploit the full capacity of negative examples. We also analyze frequency-based and spectral word embedding methods and show how the idea of negative context can be used in these types of algorithms. We propose EigenWord, a spectral word embedding with an intuitive formulation that makes use of negative examples, and theoretically show that it has an optimal closed-form solution. Finally, we tackle the Word Sense Disambiguation (WSD) problem and propose a multi-sense embedding algorithm based on EigenWord. We evaluated the proposed algorithms using both intrinsic and extrinsic evaluation of embeddings. Extensive experiments on word similarity datasets, emotion recognition from tweets, and sentiment analysis show the effectiveness of our proposed methods.