Counterfactual Visual Reasoning for Segmentation Hallucination Evaluations

Abstract

Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

✅ Contributions

New Benchmark. We present HalluSegBench, the first benchmark for evaluating segmentation hallucinations using counterfactual image-text pairs, covering 1,340 pairs across 281 object classes.

Novel Metrics. We introduce four new metrics that quantify hallucination severity under visual/textual counterfactuals, reveal over-reliance on semantic priors, and assess spatial plausibility of hallucinated masks.

Empirical Insights. Experiments on state-of-the-art vision-language segmentation models show they hallucinate more under visual edits than textual ones, highlighting the need for counterfactual-based diagnostics.

Quantitative Results

Comparison of Reasoning Segmentation Models on HalluSegBench Metrics, including textual and visual IoU drop (ΔIoU_textual, ΔIoU_visual), factual and counterfactual Confusion Mask Score ( CMS), and the contrastive hallucination metric CCMS.

Qualitative Results

HalluSegBench demonstrates the hallucination severity of different reasoning-based segmentation models. We present qualitative examples that illustrate the predictions of benchmarked models across the four query-image combinations, along with the corresponding ground truth mask.

Here, c = “full grown sheep” and c′ = “a cow”.

Here, c = “front cow” and c′ = “front pig”.

BibTeX

@article{li2025hallusegbench,
    title={HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation},
    author={Li, Xinzhuo and Juvekar, Adheesh and Liu, Xingyou and Wahed, Muntasir and Nguyen, Kiet A and Lourentzou, Ismini},
    journal={arXiv preprint arXiv:2506.21546},
    year={2025}
}