An analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation
| dc.contributor.author | Riddle, Poppy Nicolette | |
| dc.contributor.copyright-release | Not Applicable | |
| dc.contributor.degree | Doctor of Philosophy | |
| dc.contributor.department | School of Information Management | |
| dc.contributor.ethics-approval | Not Applicable | |
| dc.contributor.external-examiner | Dr. Julia Bullard | |
| dc.contributor.manuscripts | Not Applicable | |
| dc.contributor.thesis-reader | Dr. Colin Conrad | |
| dc.contributor.thesis-reader | Dr. Mike Smit | |
| dc.contributor.thesis-supervisor | Dr. Philippe Mongeon | |
| dc.date.accessioned | 2026-04-16T12:13:21Z | |
| dc.date.available | 2026-04-16T12:13:21Z | |
| dc.date.defence | 2026-04-09 | |
| dc.date.issued | 2026-04-14 | |
| dc.description | Notebooks and datasets can be found at https://github.com/poppy-nicolette/Dissertation | |
| dc.description.abstract | This dissertation explored types of noise and errors present in titles and abstracts of bibliographic metadata, prevalence in Crossref and OpenAlex, and effects on a RAG architecture. Results from a Crossref subset showed that 42% of titles and 63% of abstracts have some form of errors or noise. In a shared corpus with OpenAlex, I observed a 10% decrease in missing information, but errors and noise remained. Two types were selected to test: encoding noise (occurring in 5.7% of titles and 9.7% of abstracts) and multilingual text errors (found in 3.3% of titles and 6.7% of abstracts). RAG retrieval was shown to be beneficially affected by multilingual text with higher document similarity scores when a multilingual embedding model was used. The generator was negatively affected when multilingual abstracts were used with English-only models as evidenced by lower faithfulness scores. However, encoding noise had no significant effect on retrieval or generation. | |
| dc.identifier.uri | https://hdl.handle.net/10222/86002 | |
| dc.language.iso | en | |
| dc.subject | Crossref | |
| dc.subject | OpenAlex | |
| dc.subject | metadata quality | |
| dc.subject | retrieval-augmented generation | |
| dc.subject | noise | |
| dc.title | An analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation |
