PRIMA logoPRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

1University of Illinois Urbana-Champaign, 2Virginia Tech
* Equal Contribution
TL;DR: We introduce the task of multi-image pixel-grounded reasoning segmentation, with the release of the M4Seg benchmark consisting of โˆผ224K question-answer pairs that require fine-grained visual understanding across multiple images, and the PRIMA LVLM model that combines pixel-level grounding with multi-image reasoning capabilities.
Interpolate start reference image.
Our proposed multi-image pixel-grounded reasoning segmentation task at both object and part levels, where the goal is to achieve fine-grained comparison and contextual understanding across multiple images at pixel level and produce responses grounded in specific objects and parts.

๐Ÿ“ Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by 25.3%. To support training and evaluation, we curate M4Seg, a new reasoning segmentation benchmark consisting of โˆผ224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

๐Ÿ’ก Contributions

  • Novel Task. We propose the novel task of multi-image pixel-grounded reasoning segmentation, which necessitates fine-grained comparison and contextual understanding across multiple images at the pixel level, requiring models to produce responses grounded in specific objects and parts.

  • New Benchmark. We introduce M4Seg, a new challenging benchmark with over 224K multi-image QA pairs, annotated with multiple object and part segmentation masks to enable and evaluate pixel-grounded multi-image visual understanding.

  • New Multi-Image Pixel-Grounded LVLM. We propose PRIMA, a vision-language model specifically designed for this new task. Unlike existing models, PRIMA excels in generating natural language responses accompanied by contextually grounded segmentations across multiple images. PRIMA is optimized for computational efficiency by incorporating a cross-modal attention mechanism, which enables instruction-guided alignment of relevant visual features across images, reducing overhead while maintaining high accuracy in pixel-level reasoning. Extensive experiments demonstrate PRIMAโ€™s performance and efficiency against strong baselines.

PRIMA logo PRIMA Architecture

PRIMA Model Architecture.

PRIMA integrates a multi-image vision encoder that combines DINOv2 for dense semantic feature extraction and Q-Formerโ€™s selective query-based cross-attention to fuse relevant representations across images. The encoder outputs are mapped to a shared semantic space to facilitate precise pixel-level multi-image grounding. Leveraging a LoRA-finetuned language model and a SAM-based decoder, PRIMA dynamically generates segmentation masks corresponding to objects and parts referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.

๐Ÿ“Š Quantitative Results

PRIMA results.

Experimental Results on M4Seg. We report performance metrics for segmentation (mIoU and Recall) and reasoning (Semantic Similarity and S-IoU) to evaluate each model's ability in multi-image pixel-grounded reasoning segmentation. Computational efficiency metrics (TFLOPs and #samples/sec.) showcase PRIMAโ€™s optimized processing for multi-image tasks.

๐Ÿ”Ž PRIMA in the Wild

PRIMA in the wild results.
PRIMAโ€™s performance on unseen images found on the web. For conciseness, we only visualize relevant segmentation masks.

BibTeX

@article{wahed2024prima,
  title={PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation},
  author={Wahed, Muntasir and Nguyen, Kiet A and Juvekar, Adheesh Sunil and Li, Xinzhuo and Zhou, Xiaona and Shah, Vedant and Yu, Tianjiao and Yanardag, Pinar and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2412.15209},
  year={2024}
}