Riddle, Poppy Nicolette2026-04-162026-04-162026-04-14https://hdl.handle.net/10222/86002Notebooks and datasets can be found at https://github.com/poppy-nicolette/DissertationThis dissertation explored types of noise and errors present in titles and abstracts of bibliographic metadata, prevalence in Crossref and OpenAlex, and effects on a RAG architecture. Results from a Crossref subset showed that 42% of titles and 63% of abstracts have some form of errors or noise. In a shared corpus with OpenAlex, I observed a 10% decrease in missing information, but errors and noise remained. Two types were selected to test: encoding noise (occurring in 5.7% of titles and 9.7% of abstracts) and multilingual text errors (found in 3.3% of titles and 6.7% of abstracts). RAG retrieval was shown to be beneficially affected by multilingual text with higher document similarity scores when a multilingual embedding model was used. The generator was negatively affected when multilingual abstracts were used with English-only models as evidenced by lower faithfulness scores. However, encoding noise had no significant effect on retrieval or generation.enCrossrefOpenAlexmetadata qualityretrieval-augmented generationnoiseAn analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation