Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by 25.3%. To support training and evaluation, we curate M4Seg, a new reasoning segmentation benchmark consisting of โผ224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.
PRIMA integrates a multi-image vision encoder that combines DINOv2 for dense semantic feature extraction and Q-Formerโs selective query-based cross-attention to fuse relevant representations across images. The encoder outputs are mapped to a shared semantic space to facilitate precise pixel-level multi-image grounding. Leveraging a LoRA-finetuned language model and a SAM-based decoder, PRIMA dynamically generates segmentation masks corresponding to objects and parts referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.
Experimental Results on M4Seg. We report performance metrics for segmentation (mIoU and Recall) and reasoning (Semantic Similarity and S-IoU) to evaluate each model's ability in multi-image pixel-grounded reasoning segmentation. Computational efficiency metrics (TFLOPs and #samples/sec.) showcase PRIMAโs optimized processing for multi-image tasks.
@article{wahed2024prima,
title={PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation},
author={Wahed, Muntasir and Nguyen, Kiet A and Juvekar, Adheesh Sunil and Li, Xinzhuo and Zhou, Xiaona and Shah, Vedant and Yu, Tianjiao and Yanardag, Pinar and Lourentzou, Ismini},
journal={arXiv preprint arXiv:2412.15209},
year={2024}
}