Repository logo

An analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation

dc.contributor.authorRiddle, Poppy Nicolette
dc.contributor.copyright-releaseNot Applicable
dc.contributor.degreeDoctor of Philosophy
dc.contributor.departmentSchool of Information Management
dc.contributor.ethics-approvalNot Applicable
dc.contributor.external-examinerDr. Julia Bullard
dc.contributor.manuscriptsNot Applicable
dc.contributor.thesis-readerDr. Colin Conrad
dc.contributor.thesis-readerDr. Mike Smit
dc.contributor.thesis-supervisorDr. Philippe Mongeon
dc.date.accessioned2026-04-16T12:13:21Z
dc.date.available2026-04-16T12:13:21Z
dc.date.defence2026-04-09
dc.date.issued2026-04-14
dc.descriptionNotebooks and datasets can be found at https://github.com/poppy-nicolette/Dissertation
dc.description.abstractThis dissertation explored types of noise and errors present in titles and abstracts of bibliographic metadata, prevalence in Crossref and OpenAlex, and effects on a RAG architecture. Results from a Crossref subset showed that 42% of titles and 63% of abstracts have some form of errors or noise. In a shared corpus with OpenAlex, I observed a 10% decrease in missing information, but errors and noise remained. Two types were selected to test: encoding noise (occurring in 5.7% of titles and 9.7% of abstracts) and multilingual text errors (found in 3.3% of titles and 6.7% of abstracts). RAG retrieval was shown to be beneficially affected by multilingual text with higher document similarity scores when a multilingual embedding model was used. The generator was negatively affected when multilingual abstracts were used with English-only models as evidenced by lower faithfulness scores. However, encoding noise had no significant effect on retrieval or generation.
dc.identifier.urihttps://hdl.handle.net/10222/86002
dc.language.isoen
dc.subjectCrossref
dc.subjectOpenAlex
dc.subjectmetadata quality
dc.subjectretrieval-augmented generation
dc.subjectnoise
dc.titleAn analysis of Crossref and OpenAlex as external knowledge sources for Retrieval-Augmented Generation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PoppyRiddle2026.pdf
Size:
4.57 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.12 KB
Format:
Item-specific license agreed upon to submission
Description: