Skip to content

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Home / Papers / Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah Date: 2026-03-26 Paper ID: arxiv:2603.25711

Summary

This paper addresses the structural vulnerability of Multimodal Diffusion Large Language Models (MDLLMs) to hallucination, attributing it to an objective mismatch where token ranking relies solely on textual likelihood without visual verification. The authors propose VISAGE, a training-free decoding framework that calibrates this objective at inference time by estimating the discrepancy using the spatial entropy of cross-attention distributions. VISAGE penalizes spatially uniform attention, forcing a localization consensus across attention heads to favor visually grounded token commitments. Evaluations on hallucination-sensitive benchmarks like MMMU-val and HallusionBench demonstrate that VISAGE significantly enhances grounding resilience with substantial performance gains over existing methods.

Key Contributions

  • Identified multimodal hallucination in MDLLMs as a localized optimization error arising from an objective mismatch between language-only token ranking and visual grounding requirements.
  • Introduced VISAGE, a training-free inference-time framework that quantifies the proxy discrepancy using the spatial entropy of cross-attention distributions to enforce localization consensus across attention heads.
  • Provided an analytical stability guarantee showing that VISAGE maintains a bounded objective loss even with estimation errors during inference.
  • Achieved significant robustness gains, including an 8.59% relative improvement on MMMU-val and 7.75% on HallusionBench compared to baseline decoding methods.

Limitations

The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.

Open Questions & Future Work

Key Concepts

  • VISAGE Framework: A training-free decoding framework that estimates and calibrates the objective mismatch in Multimodal Diffusion Large Language Models (MDLLMs) by quantifying the spatial entropy of cross-attention distributions.

Datasets

Limitations

The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.

Metadata & Links

url
https://arxiv.org/abs/2603.25711
paper_id
2603.25711
paper_source
arxiv
domain
multimodal
tags
multimodalvision-language-modelllmattention-mechanismemergent-abilitiesevaluationalignment
architectures
decoder-only
datasets
MMMU-valHallusionBench
skill
GeneralMLSkill
created_at
2026-03-27T09:09:32Z