Recent advances in Large Vision-Language Models (LVLMs) have sparked significant progress in general-purpose vision tasks through visual instruction tuning. While some works have demonstrated the capability of LVLMs to generate segmentation masks that align phrases with natural language descriptions in a single image, they struggle with segmentation-grounded comparisons across multiple images, particularly at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which seeks to identify and segment common and unique objects and parts across multiple images.
To address this task, we present Calico, the first LVLM that can segment and reason over multiple masks across images, enabling object comparison based on their constituent parts. Calico features two proposed components, a novel Correspondence Extraction Module, which captures semantic-rich information to identify part-level correspondences between objects, and a Correspondence Adaptation Module, which embeds this information into the LLM and facilitates multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a comprehensive multi-image segmentation dataset containing ~2.4M samples across ~44K images with diverse object and part categories. Experimental results show Calico, finetuned on only 0.3% of its architecture, achieves robust performance in part-focused semantic co-segmentation.
Calico uses a Q-Former cross-attention module to query efficient image
embeddings from a pretrained image encoder, which are passed
into a Vicuna-based LLM
as image features. We extract [SEG]
tokens from the output text, which are used to prompt a
SAM decoder to output corresponding
segmentation masks.
We propose two modules, the Correspondence Extraction Module (CEM) and the
Correspondence Adaptation
Module (CAM), to enable the learning of semantic-rich features for multi-image correspondence. In
Calico, k CAMs are strategically placed every N/k layers within the
N-layered LLM. CEM focuses on
extracting fine-grained semantic information at the part level, capturing correspondences across similar
yet distinct object categories by leveraging self-supervised DINO features. CAMs then reintegrate this
part-level correspondence information back into the next layer of the model.
Although multi-image datasets of various scales are available, they exhibit combinations of limitations, making them unsuitable for the part-focused semantic co-segmentation task. Limitations include the absence of fine-grained masks for segmentation, datasets being too small or domain-specific to facilitate generalizable LVLM training despite containing localized labels, or the lack of part-level information altogether.
To address these challenges and enable effective training and evaluation of our part-focused semantic co-segmentation model, we introduce a novel dataset named MixedParts, curated from publicly available part segmentation datasets: ADE20K-Part234, PACO, and PartImageNet.
Example image pairs in MixedParts with objects, common parts, and unique parts segmented and labeled. Each column represents a different image pair, derived from a set of diverse datasets with various levels of detail, PACO, PartImageNet, and ADE20K-Part-234, covering both rigid and non-rigid objects and parts. Each image pair is displayed across 3 rows to illustrate (i) the (possibly common) object, (ii) the common object parts, and (iii) the unique object parts in each pair.
Experimental Results on MixedParts. The first three metrics are segmentation-based, while the last two are text-based. Calico outperforms baselines across all metrics.
: The common object is the person.
: The object present in the images is
the car.
: The images show a snake and a dog.
The
common parts in both objects are the body and the head.
: The images contain a bed and a table.
The
unique parts between the objects are a headboard,
a top, and a leg.
@article{nguyen2025calico,
title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}