Repository logo

An analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This dissertation explored types of noise and errors present in titles and abstracts of bibliographic metadata, prevalence in Crossref and OpenAlex, and effects on a RAG architecture. Results from a Crossref subset showed that 42% of titles and 63% of abstracts have some form of errors or noise. In a shared corpus with OpenAlex, I observed a 10% decrease in missing information, but errors and noise remained. Two types were selected to test: encoding noise (occurring in 5.7% of titles and 9.7% of abstracts) and multilingual text errors (found in 3.3% of titles and 6.7% of abstracts). RAG retrieval was shown to be beneficially affected by multilingual text with higher document similarity scores when a multilingual embedding model was used. The generator was negatively affected when multilingual abstracts were used with English-only models as evidenced by lower faithfulness scores. However, encoding noise had no significant effect on retrieval or generation.

Description

Notebooks and datasets can be found at https://github.com/poppy-nicolette/Dissertation

Keywords

Crossref, OpenAlex, metadata quality, retrieval-augmented generation, noise

Citation