Beyond Monocular Vision: Assessing LLaVA's Performance on an Augmented CLEVR-like Dataset with Binocular Images
Date
2025-07-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis investigates how binocular vision impacts the spatial reasoning capabilities of Large Language and Vision Assistant (LLaVA) models in visual question answering tasks. By developing BiCLEVR, an augmented CLEVR-like dataset featuring stereoscopic image pairs and expanded visual attributes, we systematically evaluate the effect of different visual inputs across varying model sizes. Our experiments compare two LLaVA variants (7B and 13B parameters) across three dataset configurations: standard CLEVR, monocular BiCLEVR, and binocular BiCLEVR. Results reveal a nuanced relationship between model capacity and the ability to leverage stereoscopic information. The larger model demonstrated significant performance improvements with binocular input, while the smaller model showed degraded performance, suggesting insufficient capacity to process the additional visual information effectively. Particularly notable were improvements in numerical comparison and counting tasks for the larger model, indicating that stereoscopic cues enhance object individuation abilities. These findings contribute to our understanding of how vision-language models process spatial information and provide a pathway toward more robust visual reasoning systems capable of understanding 3D relationships in complex environments.
Description
Keywords
Visual Question Answering, Multimodal model, Stereo Vision