Show simple item record

dc.contributor.authorHaji Soleimani, Behrouz
dc.date.accessioned2019-06-25T12:55:17Z
dc.date.available2019-06-25T12:55:17Z
dc.date.issued2019-06-25T12:55:17Z
dc.identifier.urihttp://hdl.handle.net/10222/75870
dc.description.abstractIn the big data era, most of the data generated every day are high dimensional such as text and image data. Learning compact representations from the input data can help in dealing with the high dimensionality and visualization of the data. These representations can be learned to map the input space into a latent space in which, e.g., complex relationships present in the data become more explicit, while preserving the structure and geometrical properties of the data. In the case of discrete input features such as text data, embedding algorithms try to learn a vector space representation of inputs while preserving certain aspects of similarity or relatedness. Training these embeddings is time-consuming and requires a large corpus. This makes it difficult to train task-specific embeddings due to insufficient data in the downstream tasks which has given rise to pre-trained models and embeddings. In this thesis, we first study dimensionality reduction methods and propose Unit Ball Embedding (UBE), a spherical representation learning method that learns the structure of manifolds and maximizes their separability, which makes it suitable for both clustering and visualization. We then generalize the algorithm and apply it to learn word embeddings, taking into account the contexts of individual words. Our word embedding solution learns a better distribution of words in the latent space by pushing unrelated words away from each other. We investigate and address the naive procedure of negative sampling in Word2Vec and exploit the full capacity of negative examples. We also analyze frequency-based and spectral word embedding methods and show how the idea of negative context can be used in these types of algorithms. We propose EigenWord, a spectral word embedding with an intuitive formulation that makes use of negative examples, and theoretically show that it has an optimal closed-form solution. Finally, we tackle the Word Sense Disambiguation (WSD) problem and propose a multi-sense embedding algorithm based on EigenWord. We evaluated the proposed algorithms using both intrinsic and extrinsic evaluation of embeddings. Extensive experiments on word similarity datasets, emotion recognition from tweets, and sentiment analysis show the effectiveness of our proposed methods.en_US
dc.language.isoenen_US
dc.subjectEmbeddingen_US
dc.subjectDeep Learningen_US
dc.subjectNatural Language Processing (NLP)en_US
dc.subjectWord Embeddingen_US
dc.subjectMatrix Factorizationen_US
dc.subjectSingular Value Decompositionen_US
dc.subjectComputer Visionen_US
dc.subjectDimensionality Reductionen_US
dc.subjectRepresentation Learningen_US
dc.titleLearning Embeddings for Text and Images from Structure of the Dataen_US
dc.date.defence2019-06-13
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerDr. Jackie Chi Kit Cheungen_US
dc.contributor.graduate-coordinatorDr. Michael McAllisteren_US
dc.contributor.thesis-readerDr. Daniel Silveren_US
dc.contributor.thesis-readerDr. Sageev Ooreen_US
dc.contributor.thesis-readerDr. Vlado Keseljen_US
dc.contributor.thesis-supervisorDr. Stan Matwinen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseYesen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record