Repository logo
 

Beyond Monocular Vision: Assessing LLaVA's Performance on an Augmented CLEVR-like Dataset with Binocular Images

dc.contributor.authorDevesh, Sagar
dc.contributor.copyright-releaseNot Applicable
dc.contributor.degreeMaster of Computer Science
dc.contributor.departmentFaculty of Computer Science
dc.contributor.ethics-approvalNot Applicable
dc.contributor.external-examinern/a
dc.contributor.manuscriptsNot Applicable
dc.contributor.thesis-readerVlado Keselj
dc.contributor.thesis-readerHassan Sajjad
dc.contributor.thesis-supervisorFrank Rudzicz
dc.date.accessioned2025-07-11T14:13:23Z
dc.date.available2025-07-11T14:13:23Z
dc.date.defence2025-05-29
dc.date.issued2025-07-07
dc.description.abstractThis thesis investigates how binocular vision impacts the spatial reasoning capabilities of Large Language and Vision Assistant (LLaVA) models in visual question answering tasks. By developing BiCLEVR, an augmented CLEVR-like dataset featuring stereoscopic image pairs and expanded visual attributes, we systematically evaluate the effect of different visual inputs across varying model sizes. Our experiments compare two LLaVA variants (7B and 13B parameters) across three dataset configurations: standard CLEVR, monocular BiCLEVR, and binocular BiCLEVR. Results reveal a nuanced relationship between model capacity and the ability to leverage stereoscopic information. The larger model demonstrated significant performance improvements with binocular input, while the smaller model showed degraded performance, suggesting insufficient capacity to process the additional visual information effectively. Particularly notable were improvements in numerical comparison and counting tasks for the larger model, indicating that stereoscopic cues enhance object individuation abilities. These findings contribute to our understanding of how vision-language models process spatial information and provide a pathway toward more robust visual reasoning systems capable of understanding 3D relationships in complex environments.
dc.identifier.urihttps://hdl.handle.net/10222/85207
dc.language.isoen
dc.subjectVisual Question Answering
dc.subjectMultimodal model
dc.subjectStereo Vision
dc.titleBeyond Monocular Vision: Assessing LLaVA's Performance on an Augmented CLEVR-like Dataset with Binocular Images

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SagarDevesh2025.pdf
Size:
3.08 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.12 KB
Format:
Item-specific license agreed upon to submission
Description: