Home / Papers / Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah Date: 2026-03-26 Paper ID: arxiv:2603.25711

Summary

This paper addresses the structural vulnerability of Multimodal Diffusion Large Language Models (MDLLMs) to hallucination, attributing it to an objective mismatch where token ranking relies solely on textual likelihood without visual verification. The authors propose VISAGE, a training-free decoding framework that calibrates this objective at inference time by estimating the discrepancy using the spatial entropy of cross-attention distributions. VISAGE penalizes spatially uniform attention, forcing a localization consensus across attention heads to favor visually grounded token commitments. Evaluations on hallucination-sensitive benchmarks like MMMU-val and HallusionBench demonstrate that VISAGE significantly enhances grounding resilience with substantial performance gains over existing methods.

Key Contributions

Identified multimodal hallucination in MDLLMs as a localized optimization error arising from an objective mismatch between language-only token ranking and visual grounding requirements.
Introduced VISAGE, a training-free inference-time framework that quantifies the proxy discrepancy using the spatial entropy of cross-attention distributions to enforce localization consensus across attention heads.
Provided an analytical stability guarantee showing that VISAGE maintains a bounded objective loss even with estimation errors during inference.
Achieved significant robustness gains, including an 8.59% relative improvement on MMMU-val and 7.75% on HallusionBench compared to baseline decoding methods.

Limitations

The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.

Open Questions & Future Work

temporal-consistency-in-video-mdllms

Key Concepts

VISAGE Framework: A training-free decoding framework that estimates and calibrates the objective mismatch in Multimodal Diffusion Large Language Models (MDLLMs) by quantifying the spatial entropy of cross-attention distributions.

Datasets

Limitations

The method is applied only at inference time (training-free) and its performance is dependent on the quality of the cross-attention map estimation, though analytical guarantees suggest robustness.

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Summary

Key Contributions

Limitations

Open Questions & Future Work

Key Concepts

Datasets

Limitations

Links

Metadata & Links