An analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This dissertation explored types of noise and errors present in titles and abstracts of bibliographic metadata, prevalence in Crossref and OpenAlex, and effects on a RAG architecture. Results from a Crossref subset showed that 42% of titles and 63% of abstracts have some form of errors or noise. In a shared corpus with OpenAlex, I observed a 10% decrease in missing information, but errors and noise remained. Two types were selected to test: encoding noise (occurring in 5.7% of titles and 9.7% of abstracts) and multilingual text errors (found in 3.3% of titles and 6.7% of abstracts). RAG retrieval was shown to be beneficially affected by multilingual text with higher document similarity scores when a multilingual embedding model was used. The generator was negatively affected when multilingual abstracts were used with English-only models as evidenced by lower faithfulness scores. However, encoding noise had no significant effect on retrieval or generation.
Description
Notebooks and datasets can be found at https://github.com/poppy-nicolette/Dissertation
Keywords
Crossref, OpenAlex, metadata quality, retrieval-augmented generation, noise
