Repository logo
 

Investigating Word Embedding Techniques for Extracting Disease, Gene, and Chemical Relationships from Biomedical Texts

Date

2024-09-25

Authors

S Pradeep, Sushumna

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis investigates word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, with a specific focus on cancer-related entities. Firstly, I study the effectiveness of word embedding models in identifying current functional relationships (e.g., interaction) between genes, diseases, and chemicals, as recorded in the medical literature. As a reference, I use curated functional relationships from the Comparative Toxicogenomics Database (CTD). The goal is to evaluate each word embedding model, highlighting their strengths and weaknesses in identifying functional relationships in particular, and in biomedical text mining in general. Next, I study the ability of word embedding models in discovering previously unknown functional relationships from the medical literature. I create word embeddings from the medical literature up until 2022, and check whether they can identify functional relationships that were not in CTD at that time (i.e., functional relationships found in CTD version 2024 but not part of CTD version 2022; time-slicing). If this is successful, it means that word embedding models can conduct LBD; they can identify previously unknown functional relationships from the medical literature. We created word embeddings using models such as CBOW, SkipGram, GloVe, BioBERT, and PubMedBERT based on PubMed abstracts up to 2022. After generating the embeddings, we measured functional relatedness using cosine similarity for curated pairs from the CTD dataset. To evaluate the performance of these models, we calculated precision and recall by comparing the curated CTD pairs with the instance vector pairs of instances from CTD, using cosine similarity thresholds of 0.6, 0.7, and 0.8. Once these values were obtained, heatmaps were plotted to compare model performance and identify which model produced the best results. The findings reveal that PubMedBERT and BioBERT significantly outperform traditional models like CBOW, SkipGram, and GloVe both on precision and recall; especially at a cosine similarity threshold of 0.7, which has been identified as an optimal balance between accuracy and comprehensive data retrieval. The results also show that the word embeddings created from PubMed abstracts up to 2022 are able to capture functional relationships in newly curated pairs from the CTD dataset. Specifically, the dataset included 157 disease-chemical pairs, 138 disease-gene pairs, and 191 chemical-gene pairs. Using the generated word embeddings, the model successfully captured relatedness in 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs.

Description

This thesis investigates various word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, specifically focusing on cancer-related entities. The study evaluates the effectiveness of these models in identifying known functional relationships among genes, diseases, and chemicals, using curated data from the Comparative Toxicogenomics Database (CTD) as a reference. Initially, the research assesses how well these models capture existing interactions within the medical literature. Subsequently, it explores the models' capabilities to discover previously unknown functional relationships, specifically targeting relationships that emerged in CTD version 2024 but were absent in version 2022. Word embeddings were generated from PubMed abstracts up to 2022, and their functional relatedness was measured using cosine similarity for curated pairs from the CTD dataset. Performance was evaluated through precision and recall calculations at cosine similarity thresholds of 0.6, 0.7, and 0.8. Heatmaps were used to compare model performance. The findings indicate that PubMedBERT and BioBERT significantly outperformed traditional models like CBOW, SkipGram, and GloVe, particularly at a threshold of 0.7, which balances accuracy and data retrieval. Notably, the embeddings successfully captured functional relationships in newly curated pairs from the CTD dataset, including 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs, demonstrating the models' potential for conducting LBD in biomedical literature.

Keywords

NLP, Word Embeddings, LBD

Citation