Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs
Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs
Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah Date: 2026-03-26 Paper ID: arxiv:2603.25711
Summary
This paper addresses the structural vulnerability of Multimodal Diffusion Large Language Models (MDLLMs) to hallucination, attributing it to an objective mismatch where token ranking relies solely on textual likelihood without visual verification. The authors propose VISAGE, a training-free decoding framework that calibrates this objective at inference time by estimating the discrepancy using the spatial entropy of cross-attention distributions. VISAGE penalizes spatially uniform attention, forcing a localization consensus across attention heads to favor visually grounded token commitments. Evaluations on hallucination-sensitive benchmarks like MMMU-val and HallusionBench demonstrate that VISAGE significantly enhances grounding resilience with substantial performance gains over existing methods.
Key Contributions
- Identified multimodal hallucination in MDLLMs as a localized optimization error arising from an objective mismatch between language-only token ranking and visual grounding requirements.
- Introduced VISAGE, a training-free inference-time framework that quantifies the proxy discrepancy using the spatial entropy of cross-attention distributions to enforce localization consensus across attention heads.
- Provided an analytical stability guarantee showing that VISAGE maintains a bounded objective loss even with estimation errors during inference.
- Achieved significant robustness gains, including an 8.59% relative improvement on MMMU-val and 7.75% on HallusionBench compared to baseline decoding methods.
Limitations
The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.
Open Questions & Future Work
Key Concepts
- VISAGE Framework: A training-free decoding framework that estimates and calibrates the objective mismatch in Multimodal Diffusion Large Language Models (MDLLMs) by quantifying the spatial entropy of cross-attention distributions.
Datasets
Limitations
The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.
Links
Metadata & Links
- url
- https://arxiv.org/abs/2603.25711
- paper_id
- 2603.25711
- paper_source
- arxiv
- domain
- multimodal
- tags
- multimodalvision-language-modelllmattention-mechanismemergent-abilitiesevaluationalignment
- architectures
- decoder-only
- datasets
- MMMU-valHallusionBench
- skill
- GeneralMLSkill
- created_at
- 2026-03-27T09:09:32Z