PRIMA logoPRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

1University of Illinois Urbana-Champaign, 2Virginia Tech
* Equal Contribution
TL;DR: We introduce the task of multi-image pixel-grounded reasoning segmentation, with the release of the M4Seg benchmark consisting of ∼744K question-answer pairs that require fine-grained visual understanding across multiple images, and the PRIMA LVLM model that combines pixel-level grounding with multi-image reasoning capabilities.
Interpolate start reference image.
We introduce the new task of multi-image pixel-grounded reasoning segmentation. To support this task, we curate M4Seg, a benchmark providing question-answer (QA) pairs alongside image sets with pixel-level annotations. Additionally, we propose PRIMA, a model designed to efficiently identify and compare objects' contextual relationships across scenes. We focus on four key categories essential for multi-image understanding: functional, spatial, numerical, and open-ended reasoning.

📝 Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4Seg, a new multi-image reasoning segmentation benchmark consisting of ~744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with 7.83% and 11.25% improvements in Recall and S-IoU, respectively. Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.

💡 Contributions

  • Novel Task. We propose the novel task of multi-image pixel-grounded reasoning segmentation, which requires fine-grained comparison and contextual understanding across multiple images at pixel level and natural language reasoning.

  • New Benchmark. We curate M4Seg, a new challenging benchmark with ~744K multi-image QA pairs, annotated with multiple object and part segmentation masks to train and evaluate multi-image pixel-grounding models.

  • New Multi-Image Pixel-Grounded LVLM. We propose PRIMA, an LVLM designed to perform instruction-guided cross-image alignment of relevant visual features via a novel SQuARE module, enabling reasoning with contextually grounded segmentation masks across multiple images. Experiments demonstrate PRIMA's impressive performance compared to strong baselines across segmentation (+↑8.11% mIoU and +↑7.83% Recall) and text-based metrics (+↑6.45% Semantic Similarity and +↑11.25% S-IoU).

PRIMA logo PRIMA Architecture

PRIMA Model Architecture.

Overview of the proposed PRIMA architecture. Leveraging a LoRA-finetuned language model, a novel SQUARE vision encoder, and a SAM-based decoder, PRIMA dynamically generates segmentation masks corresponding to objects referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.

PRIMA logo SQuARE module

SQuARE Module.

Our proposed SQuARE module. Learnable relational queries attend over the concatenated multi-image features to form a shared relational representation. This representation is injected into the query pathway for global feature extraction, producing enriched visual representations that capture cross-image interactions.

📊 Quantitative Results

PRIMA results.

Experimental Results on M4Seg. PRIMA significantly outperforms both general-purpose and pixel-grounding LVLM baselines. The general-purpose LVLMs struggle with this task, a limitation we attribute to their lack of task-specific training and cascading errors from using SAM on the text-based grounding information. While Gemini-2.5 Pro is the top performer in this category, its performance remains limited with 30.21% mIoU and 68.51% SS. The reasoning segmentation baselines perform better, as they leverage pixel grounding capabilities and are finetuned on M4Seg. GLaMM, for instance, outperforms the general-purpose LVLMs with 38.12% mIoU and 74.05% SS. Compared to all baselines, PRIMA sets a new benchmark, surpassing the next best baseline by 8.11% and 12.33% in terms of mIoU and I-SS, respectively. PRIMA also achieves a gain of up to 6.45% and 11.25% on the text metrics SS and SIoU.

🔎 Qualitative Examples

Qualitative Examples.
Qualitative Results. PRIMA exhibits strong qualitative performance in both segmentation and reasoning. Each example shows the posted question, the textual responses from each model, and the corresponding segmentation masks alongside the Ground Truth reference. For clarity, we use consistent colors between the segmentation masks and the highlighted text spans referring to each object or part. The results indicate that PRIMA produces high-quality textual responses together with crisp, well-localized segmentation masks, whereas GLaMM often yields noisy masks, attends to incorrect objects (e.g., treating the cabinet as a surface for placing small items), and hallucinates categories (e.g., misidentifying the horse as a truck or a dog). We also include a failure case in which PRIMA predicts the correct number of objects but fails to precisely distinguish individual instances, instead producing overlapping masks.

🔎 PRIMA in the Wild

PRIMA in the wild results.
PRIMA's performance on unseen images found on the web. Notably, PRIMA generates superior segmentation masks compared to baselines, as exemplified by the fine-grained details of the hammers in Example 3 and the dining chairs in Example 4. Moreover, the baselines are more prone to hallucinations, e.g., "pizza is shown with multiple slices" in the first example (GLaMM), "watch" in the second example (both GLaMM and LISA). Beyond visual recognition, these examples underscore PRIMA's capability to retain compositional reasoning capabilities when transferring to diverse and noisy real-world images.

BibTeX

@article{wahed2024prima,
  title={PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation},
  author={Wahed, Muntasir and Nguyen, Kiet A and Juvekar, Adheesh Sunil and Li, Xinzhuo and Zhou, Xiaona and Shah, Vedant and Yu, Tianjiao and Yanardag, Pinar and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2412.15209},
  year={2024}
}