[CVPR'25] CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects, as well as common and unique object parts across images.

To address this task, we present Calico, the first LVLM designed for multi-image part-level reasoning segmentation. Calico features two proposed components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing ~2.4M samples across ~44K images spanning diverse object and part categories. With just 0.3% of its parameters finetuned, Calico achieves strong performance on this challenging task.

Video

Poster

✅ Contributions

Novel Task. We introduce the novel task of part-focused semantic co-segmentation, which aims to co-segment and label common and unique parts between objects across images for granular object comparison. To the best of our knowledge, this is the first work to formalize this multi-image object/part co-segmentation task.

New Multi-Image Pixel-Grounded LVLM. We propose Calico (Component-Focused Adaptive Learning for Multi-Image Co-Localization of Objects), an LVLM designed for part-focused semantic co-segmentation. Calico incorporates a novel correspondence extraction module to learn cross-image semantic correspondences and an adaptation module to enable localized co-segmentation across multiple images in a parameter-efficient manner.

New Dataset. We introduce the MixedParts dataset for part-focused semantic co-segmentation, compiled from diverse part segmentation datasets and featuring images of logically comparable objects and parts.

Calico Architecture

Calico uses a Q-Former cross-attention module to query efficient image embeddings from a pretrained image encoder, which are passed into a Vicuna-based LLM as image features. We extract [SEG] tokens from the output text, which are used to prompt a SAM decoder to output corresponding segmentation masks.

We propose two modules, the Correspondence Extraction Module (CEM) and the Correspondence Adaptation Module (CAM), to enable the learning of semantic-rich features for multi-image correspondence. In Calico, k CAMs are strategically placed every N/k layers within the N-layered LLM. CEM focuses on extracting fine-grained semantic information at the part level, capturing correspondences across similar yet distinct object categories by leveraging self-supervised DINO features. CAMs then reintegrate this part-level correspondence information back into the next layer of the model.

MixedParts Dataset

To support training and evaluation for part-focused co-segmentation, we introduce a novel dataset named MixedParts, curated from publicly available part segmentation datasets: ADE20K-Part234, PACO, and PartImageNet.

Example image pairs in MixedParts with objects, common parts, and unique parts segmented and labeled. Each column represents a different image pair, derived from a set of diverse datasets with various levels of detail, PACO, PartImageNet, and ADE20K-Part-234, covering both rigid and non-rigid objects and parts. Each image pair is displayed across 3 rows to illustrate (i) the (possibly common) object, (ii) the common object parts, and (iii) the unique object parts in each pair.

Quantitative Results

Experimental Results on MixedParts. The first three metrics are segmentation-based, while the last two are text-based. Calico outperforms baselines across all metrics.

Qualitative Results

Calico demonstrates pixel-grounded understanding of various non-rigid (e.g., humans, animals) and rigid objects (e.g., car, bed), including less common objects and their parts (e.g., the bed and pocket of a pool table):

CALICO cat : The common object is the person.

CALICO cat : The images include a car.

CALICO cat : The images show a snake and a dog.
The detected common parts are a body and a head.

CALICO cat : The images show a bed and a pool table.
The unique parts present are a headboard, a bed, and a pocket.

Calico outputs are highly context-driven when distinguishing objects across images, despite variations in angle, size, saliency, etc. Different image pairings prompt the model to segment different objects accordingly, rather than defaulting to the most salient object in each image:

CALICO cat : The images show a dog.