Combine multiple VFM features
Background: The effectiveness of representation alignment in generative models can be influenced by the specific characteristics emphasized by the chosen Vision Foundation Model (VFM) used for guidance.
Question / Future Work: Future work should investigate combining the features from multiple Vision Foundation Models (VFMs) as alignment targets, rather than relying on a single VFM, to achieve a more robust and comprehensive feature alignment signal.
Metadata & Links
- created_at
- 2026-03-27T06:07:04Z