One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
Authors: Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard Date: 2026-03-24 Paper ID: arxiv:2603.23488
Summary
This paper introduces OVIE, a novel framework for monocular novel view synthesis trained entirely on large-scale, unpaired internet images, challenging the necessity of multi-view supervision. The core methodology involves using a monocular depth estimator during training to lift a source image into 3D space, apply a camera transformation, and project to create a pseudo-target view for supervision. To manage disocclusions inherent in this synthetic supervision, OVIE employs a masked training objective that restricts loss calculation to regions supported by both the source and pseudo-target projections. The resulting model significantly outperforms prior methods in a zero-shot setting and demonstrates substantial speed advantages at inference, requiring no explicit 3D structure or depth estimation during deployment.
Key Contributions
- Introduction of OVIE, a novel view synthesis method trained exclusively on single, unpaired, in-the-wild images, eliminating the need for multi-view supervision.
- Leveraging a monocular depth estimator only as a geometric scaffold during training to generate pseudo-target views via 3D lifting and projection.
- Development of a masked training formulation to handle view-dependent artifacts like disocclusions by restricting loss computation to geometrically valid regions.
- Achieving state-of-the-art zero-shot performance compared to prior multi-view trained methods while being significantly faster (600x) at inference.
Limitations
The method relies on an external depth estimator during training, which introduces dependence on the accuracy and generalizability of that scaffold. Performance might be limited by the uncurated nature of the massive training data.
Key Concepts
- OVIE: A model for monocular novel view generation trained entirely on unpaired internet images without requiring explicit 3D representations or depth estimation at inference time.
Datasets
Limitations
The method relies on an external depth estimator during training, which introduces dependence on the accuracy and generalizability of that scaffold. Performance might be limited by the uncurated nature of the massive training data.
Links
Metadata & Links
- url
- https://arxiv.org/abs/2603.23488
- paper_id
- 2603.23488
- paper_source
- arxiv
- domain
- computer-vision
- tags
- computer-visionself-supervised-learningimage-generationobject-detectionrobustness
- architectures
-
- datasets
- internet images (uncurated)
- skill
- GeneralMLSkill
- created_at
- 2026-03-25T21:18:05Z