Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects, as well as common and unique object parts across images.
To address this task, we present Calico, the first LVLM designed for multi-image part-level reasoning segmentation. Calico features two proposed components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing ~2.4M samples across ~44K images spanning diverse object and part categories. With just 0.3% of its parameters finetuned, Calico achieves strong performance on this challenging task.
Calico uses a Q-Former cross-attention module to query efficient image
embeddings from a pretrained image encoder, which are passed
into a Vicuna-based LLM
as image features. We extract [SEG]
tokens from the output text, which are used to prompt a
SAM decoder to output corresponding
segmentation masks.
We propose two modules, the Correspondence Extraction Module (CEM) and the
Correspondence Adaptation
Module (CAM), to enable the learning of semantic-rich features for multi-image correspondence. In
Calico, k CAMs are strategically placed every N/k layers within the
N-layered LLM. CEM focuses on
extracting fine-grained semantic information at the part level, capturing correspondences across similar
yet distinct object categories by leveraging self-supervised DINO features. CAMs then reintegrate this
part-level correspondence information back into the next layer of the model.
To support training and evaluation for part-focused co-segmentation, we introduce a novel dataset named MixedParts, curated from publicly available part segmentation datasets: ADE20K-Part234, PACO, and PartImageNet.
Example image pairs in MixedParts with objects, common parts, and unique parts segmented and labeled. Each column represents a different image pair, derived from a set of diverse datasets with various levels of detail, PACO, PartImageNet, and ADE20K-Part-234, covering both rigid and non-rigid objects and parts. Each image pair is displayed across 3 rows to illustrate (i) the (possibly common) object, (ii) the common object parts, and (iii) the unique object parts in each pair.
Experimental Results on MixedParts. The first three metrics are segmentation-based, while the last two are text-based. Calico outperforms baselines across all metrics.
Calico demonstrates pixel-grounded understanding of various non-rigid (e.g., humans, animals) and rigid objects (e.g., car, bed), including less common objects and their parts (e.g., the bed and pocket of a pool table):
: The common object is the person.
: The images include
a car.
: The images show a snake and a dog.
The
detected common parts are a body and a head.
: The images show a bed and a pool table.
The
unique parts present are a headboard,
a bed, and a pocket.
Calico outputs are highly context-driven when distinguishing objects across images, despite variations in angle, size, saliency, etc. Different image pairings prompt the model to segment different objects accordingly, rather than defaulting to the most salient object in each image:
: The images show a dog.
: The images include a car.
@article{nguyen2025calico,
title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
journal={In Proceedings for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}