PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

📝 Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by 25.3%. To support training and evaluation, we curate M⁴Seg, a new reasoning segmentation benchmark consisting of ∼224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

💡 Contributions

Novel Task. We propose the novel task of multi-image pixel-grounded reasoning segmentation, which necessitates fine-grained comparison and contextual understanding across multiple images at the pixel level, requiring models to produce responses grounded in specific objects and parts.

New Benchmark. We introduce M⁴Seg, a new challenging benchmark with over 224K multi-image QA pairs, annotated with multiple object and part segmentation masks to enable and evaluate pixel-grounded multi-image visual understanding.

New Multi-Image Pixel-Grounded LVLM. We propose PRIMA, a vision-language model specifically designed for this new task. Unlike existing models, PRIMA excels in generating natural language responses accompanied by contextually grounded segmentations across multiple images. PRIMA is optimized for computational efficiency by incorporating a cross-modal attention mechanism, which enables instruction-guided alignment of relevant visual features across images, reducing overhead while maintaining high accuracy in pixel-level reasoning. Extensive experiments demonstrate PRIMA’s performance and efficiency against strong baselines.

PRIMA Architecture

PRIMA integrates a multi-image vision encoder that combines DINOv2 for dense semantic feature extraction and Q-Former’s selective query-based cross-attention to fuse relevant representations across images. The encoder outputs are mapped to a shared semantic space to facilitate precise pixel-level multi-image grounding. Leveraging a LoRA-finetuned language model and a SAM-based decoder, PRIMA dynamically generates segmentation masks corresponding to objects and parts referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.

📊 Quantitative Results

Experimental Results on M⁴Seg. We report performance metrics for segmentation (mIoU and Recall) and reasoning (Semantic Similarity and S-IoU) to evaluate each model's ability in multi-image pixel-grounded reasoning segmentation. Computational efficiency metrics (TFLOPs and #samples/sec.) showcase PRIMA’s optimized processing for multi-image tasks.

🔎 PRIMA in the Wild

PRIMA’s performance on unseen images found on the web. For conciseness, we only visualize relevant segmentation masks.

BibTeX

@article{wahed2024prima,
  title={PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation},
  author={Wahed, Muntasir and Nguyen, Kiet A and Juvekar, Adheesh Sunil and Li, Xinzhuo and Zhou, Xiaona and Shah, Vedant and Yu, Tianjiao and Yanardag, Pinar and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2412.15209},
  year={2024}
}