Investigating Word Embedding Techniques for Extracting Disease, Gene, and Chemical Relationships from Biomedical Texts
Abstract
This thesis investigates word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, with a specific focus on cancer-related entities. Firstly, I study the effectiveness of word embedding models in identifying current functional relationships (e.g., interaction) between genes, diseases, and chemicals, as recorded in the medical literature. As a reference, I use curated functional relationships from the Comparative Toxicogenomics Database (CTD). The goal is to evaluate each word embedding model, highlighting their strengths and weaknesses in identifying functional relationships in particular, and in biomedical text mining in general.
Next, I study the ability of word embedding models in discovering previously unknown functional relationships from the medical literature. I create word embeddings from the medical literature up until 2022, and check whether they can identify functional relationships that were not in CTD at that time (i.e., functional relationships found in CTD version 2024 but not part of CTD version 2022; time-slicing). If this is successful, it means that word embedding models can conduct LBD; they can identify previously unknown functional relationships from the medical literature.
We created word embeddings using models such as CBOW, SkipGram, GloVe, BioBERT, and PubMedBERT based on PubMed abstracts up to 2022. After generating the embeddings, we measured functional relatedness using cosine similarity for curated pairs from the CTD dataset. To evaluate the performance of these models, we calculated precision and recall by comparing the curated CTD pairs with the instance vector pairs of instances from CTD, using cosine similarity thresholds of 0.6, 0.7, and 0.8. Once these values were obtained, heatmaps were plotted to compare model performance and identify which model produced the best results.
The findings reveal that PubMedBERT and BioBERT significantly outperform traditional models like CBOW, SkipGram, and GloVe both on precision and recall; especially at a cosine similarity threshold of 0.7, which has been identified as an optimal balance between accuracy and comprehensive data retrieval.
The results also show that the word embeddings created from PubMed abstracts up to 2022 are able to capture functional relationships in newly curated pairs from the CTD dataset. Specifically, the dataset included 157 disease-chemical pairs, 138 disease-gene pairs, and 191 chemical-gene pairs. Using the generated word embeddings, the model successfully captured relatedness in 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs.