Concept Embedding for Deep Neural Functional Analysis of Genes and Deep Neural Word Sense Disambiguation of Biomedical Text
As far as Gene Ontology (GO) is concerned, most of the existing gene functional similarity measures combine information content-based semantic similarity scores of single GO-term pairs to estimate gene functional similarity, whereas a few models base their approach on Jaccard similarity to compare GO terms in groups for this measurement. However, almost all of these measures are dependent on the ever-changing structure of GO, they are slow and task-dependent, and do not consider the valuable natural language definition of GO terms. The first part of this thesis introduces the simDEF model which avoids these drawbacks by considering the advantage of distributed representation of GO terms using their text definitions. Manual feature engineering, large dimensions of distributed GO-term vectors, the use of traditional metrics to aggregate GO-term similarity scores prior to computation of gene functional similarity, and, resorting to separate evaluation of each sub-ontology in GO (biological process, cellular component, or molecular function) in a biological task, are challenges that can be addressed by Deep Learning. Therefore, we introduce deepSimDEF that avoids the majority of the above-mentioned issues. For this purpose, deepSimDEF network(s) learn low-dimensional vectors of GO terms and gene products, and then learn how to calculate the functional similarity of protein pairs using these vectors (a.k.a. embeddings). By considering all GO sub-ontologies, deepSimDEF increases yeast PPI predictability by ~4%, shows a Pearson's correlation improvement >6% with yeast gene expression and >4% with human gene expression, and improves correlation with yeast sequence homology by up to 11%. The beneficial method for distributed representations of GO terms can be utilized in other domains of Machine Learning for low-dimensional embedding of concepts. In the second part of this thesis, this concept embedding method is evaluated in the task of Word Sense Disambiguation of natural text. Hence, deepBioWSD, a one-size-fits-all model is devised which consists of a single Bidirectional Long Short-Term Memory network classifier. We use the MSH-WSD dataset to compare WSD algorithms while macro and micro accuracies are employed as evaluation metrics. We show deepBioWSD outperforms the existing supervised models in (biomedical) text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy.