Repository logo
 

Beyond Monocular Vision: Assessing LLaVA's Performance on an Augmented CLEVR-like Dataset with Binocular Images

Date

2025-07-07

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis investigates how binocular vision impacts the spatial reasoning capabilities of Large Language and Vision Assistant (LLaVA) models in visual question answering tasks. By developing BiCLEVR, an augmented CLEVR-like dataset featuring stereoscopic image pairs and expanded visual attributes, we systematically evaluate the effect of different visual inputs across varying model sizes. Our experiments compare two LLaVA variants (7B and 13B parameters) across three dataset configurations: standard CLEVR, monocular BiCLEVR, and binocular BiCLEVR. Results reveal a nuanced relationship between model capacity and the ability to leverage stereoscopic information. The larger model demonstrated significant performance improvements with binocular input, while the smaller model showed degraded performance, suggesting insufficient capacity to process the additional visual information effectively. Particularly notable were improvements in numerical comparison and counting tasks for the larger model, indicating that stereoscopic cues enhance object individuation abilities. These findings contribute to our understanding of how vision-language models process spatial information and provide a pathway toward more robust visual reasoning systems capable of understanding 3D relationships in complex environments.

Description

Keywords

Visual Question Answering, Multimodal model, Stereo Vision

Citation