Skip to content

TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Home / Papers / TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Authors: Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang Date: 2026-03-25 Paper ID: arxiv:2603.24584

Summary

This paper addresses the reliability degradation in Vision-Language-Action (VLA) models, particularly in cluttered scenes caused by instance-level grounding failures, by introducing Target-Agnostic Guidance (TAG). TAG is an inference-time mechanism inspired by classifier-free guidance, which computes a residual steering signal as the difference between the policy’s action prediction on the original observation and an object-erased observation. This signal explicitly strengthens the evidence related to the target object, improving grounding accuracy without modifying the core VLA policy architecture. Evaluations on benchmarks like LIBERO and VLABench confirm that TAG significantly enhances robustness under clutter and reduces erroneous actions on the wrong object instance.

Key Contributions

  • Proposed Target-Agnostic Guidance (TAG), a novel inference-time mechanism that uses prediction differences between standard and object-erased observations to reduce distractor bias in VLA policies.
  • Demonstrated that TAG consistently improves the robustness of VLA models against scene clutter and reduces near-miss or wrong-object execution errors across manipulation benchmarks.
  • TAG is a policy-agnostic method that requires minimal architectural or training modifications to existing Vision-Language-Action models.

Limitations

The primary limitation is the reliance on an object-erased observation, which necessitates a reliable mechanism (like an object mask) to generate the counterfactual input, potentially adding overhead or introducing mask-related errors.

Open Questions & Future Work

Key Concepts

  • Target-Agnostic Guidance: An inference-time guidance mechanism for VLA models that uses the difference between predictions on the original observation and an object-erased observation to steer actions toward the intended target.

Datasets

Limitations

The primary limitation is the reliance on an object-erased observation, which necessitates a reliable mechanism (like an object mask) to generate the counterfactual input, potentially adding overhead or introducing mask-related errors.

Metadata & Links

url
https://arxiv.org/abs/2603.24584
paper_id
2603.24584
paper_source
arxiv
domain
robotics
tags
vision-language-modelroboticsagentguidanceobject-detectionvision-language-action-modeltool-use
architectures
datasets
LIBEROLIBERO-PlusVLABench
skill
GeneralMLSkill
created_at
2026-03-26T06:26:20Z