Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4Seg, a new multi-image reasoning segmentation benchmark consisting of ~744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with 7.83% and 11.25% improvements in Recall and S-IoU, respectively. Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.
Overview of the proposed PRIMA architecture. Leveraging a LoRA-finetuned language model, a novel SQUARE vision encoder, and a SAM-based decoder, PRIMA dynamically generates segmentation masks corresponding to objects referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.
Our proposed SQuARE module. Learnable relational queries attend over the concatenated multi-image features to form a shared relational representation. This representation is injected into the query pathway for global feature extraction, producing enriched visual representations that capture cross-image interactions.
Experimental Results on M4Seg. PRIMA significantly outperforms both general-purpose and pixel-grounding LVLM baselines. The general-purpose LVLMs struggle with this task, a limitation we attribute to their lack of task-specific training and cascading errors from using SAM on the text-based grounding information. While Gemini-2.5 Pro is the top performer in this category, its performance remains limited with 30.21% mIoU and 68.51% SS. The reasoning segmentation baselines perform better, as they leverage pixel grounding capabilities and are finetuned on M4Seg. GLaMM, for instance, outperforms the general-purpose LVLMs with 38.12% mIoU and 74.05% SS. Compared to all baselines, PRIMA sets a new benchmark, surpassing the next best baseline by 8.11% and 12.33% in terms of mIoU and I-SS, respectively. PRIMA also achieves a gain of up to 6.45% and 11.25% on the text metrics SS and SIoU.
@article{wahed2024prima,
title={PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation},
author={Wahed, Muntasir and Nguyen, Kiet A and Juvekar, Adheesh Sunil and Li, Xinzhuo and Zhou, Xiaona and Shah, Vedant and Yu, Tianjiao and Yanardag, Pinar and Lourentzou, Ismini},
journal={arXiv preprint arXiv:2412.15209},
year={2024}
}