CALICO catCalico: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Accepted to CVPR 2025

PLAN Lab, University of Illinois Urbana-Champaign
TL;DR: We present Calico, the first LVLM designed for part-focused semantic co-segmentation, a new task that identifies and segments common and unique object parts across multiple images. Trained on MixedParts, our new dataset with ~2.4M samples across ~44K images, Calico achieves strong performance in this domain with just 0.3% of its architecture finetuned.
Interpolate start reference image.
Our proposed part-focused semantic co-segmentation task, where the goal is to identify, segment, and label common objects, as well as common and unique object parts across multiple images.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have sparked significant progress in general-purpose vision tasks through visual instruction tuning. While some works have demonstrated the capability of LVLMs to generate segmentation masks that align phrases with natural language descriptions in a single image, they struggle with segmentation-grounded comparisons across multiple images, particularly at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which seeks to identify and segment common and unique objects and parts across multiple images.

To address this task, we present Calico, the first LVLM that can segment and reason over multiple masks across images, enabling object comparison based on their constituent parts. Calico features two proposed components, a novel Correspondence Extraction Module, which captures semantic-rich information to identify part-level correspondences between objects, and a Correspondence Adaptation Module, which embeds this information into the LLM and facilitates multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a comprehensive multi-image segmentation dataset containing ~2.4M samples across ~44K images with diverse object and part categories. Experimental results show Calico, finetuned on only 0.3% of its architecture, achieves robust performance in part-focused semantic co-segmentation.

✅ Contributions

  • Novel Task. We introduce the novel task of part-focused semantic co-segmentation, which aims to co-segment and label common and unique parts between objects across images for granular object comparison. To the best of our knowledge, this is the first work to formalize this multi-image object/part co-segmentation task.

  • New Multi-Image Pixel-Grounded LVLM. We propose Calico (Component-Focused Adaptive Learning for Multi-Image Co-Localization of Objects), an LVLM designed for part-focused semantic co-segmentation. CALICO incorporates a novel correspondence extraction module to learn cross-image semantic correspondences and an adaptation module to enable localized co-segmentation across multiple images in a parameter-efficient manner.

  • New Dataset. We introduce the MixedParts dataset for part-focused semantic co-segmentation, compiled from diverse part segmentation datasets and featuring images of logically comparable objects and parts.

Calico Architecture

CALICO Model Architecture.

Calico uses a Q-Former cross-attention module to query efficient image embeddings from a pretrained image encoder, which are passed into a Vicuna-based LLM as image features. We extract [SEG] tokens from the output text, which are used to prompt a SAM decoder to output corresponding segmentation masks.

We propose two modules, the Correspondence Extraction Module (CEM) and the Correspondence Adaptation Module (CAM), to enable the learning of semantic-rich features for multi-image correspondence. In Calico, k CAMs are strategically placed every N/k layers within the N-layered LLM. CEM focuses on extracting fine-grained semantic information at the part level, capturing correspondences across similar yet distinct object categories by leveraging self-supervised DINO features. CAMs then reintegrate this part-level correspondence information back into the next layer of the model.

CALICO novel modules CEM and CAM.

MixedParts Dataset

Although multi-image datasets of various scales are available, they exhibit combinations of limitations, making them unsuitable for the part-focused semantic co-segmentation task. Limitations include the absence of fine-grained masks for segmentation, datasets being too small or domain-specific to facilitate generalizable LVLM training despite containing localized labels, or the lack of part-level information altogether.

To address these challenges and enable effective training and evaluation of our part-focused semantic co-segmentation model, we introduce a novel dataset named MixedParts, curated from publicly available part segmentation datasets: ADE20K-Part234, PACO, and PartImageNet.

MixedParts dataset examples.

Example image pairs in MixedParts with objects, common parts, and unique parts segmented and labeled. Each column represents a different image pair, derived from a set of diverse datasets with various levels of detail, PACO, PartImageNet, and ADE20K-Part-234, covering both rigid and non-rigid objects and parts. Each image pair is displayed across 3 rows to illustrate (i) the (possibly common) object, (ii) the common object parts, and (iii) the unique object parts in each pair.

Quantitative Results

Calico results.

Experimental Results on MixedParts. The first three metrics are segmentation-based, while the last two are text-based. Calico outperforms baselines across all metrics.

Qualitative Results

Image 1 Image 2

CALICO cat: The common object is the person.

Image 1 Image 2

CALICO cat: The object present in the images is the car.

Image 1 Image 2
Image 1 Image 2

CALICO cat: The images show a snake and a dog.
The common parts in both objects are the body and the head.

Image 1 Image 2
Image 1 Image 2

CALICO cat: The images contain a bed and a table.
The unique parts between the objects are a headboard, a top, and a leg.

BibTeX

@article{nguyen2025calico,
  title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
  author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
  journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}