Home / Papers / One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Authors: Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard Date: 2026-03-24 Paper ID: arxiv:2603.23488

Summary

This paper introduces OVIE, a novel framework for monocular novel view synthesis trained entirely on large-scale, unpaired internet images, challenging the necessity of multi-view supervision. The core methodology involves using a monocular depth estimator during training to lift a source image into 3D space, apply a camera transformation, and project to create a pseudo-target view for supervision. To manage disocclusions inherent in this synthetic supervision, OVIE employs a masked training objective that restricts loss calculation to regions supported by both the source and pseudo-target projections. The resulting model significantly outperforms prior methods in a zero-shot setting and demonstrates substantial speed advantages at inference, requiring no explicit 3D structure or depth estimation during deployment.

Key Contributions

Introduction of OVIE, a novel view synthesis method trained exclusively on single, unpaired, in-the-wild images, eliminating the need for multi-view supervision.
Leveraging a monocular depth estimator only as a geometric scaffold during training to generate pseudo-target views via 3D lifting and projection.
Development of a masked training formulation to handle view-dependent artifacts like disocclusions by restricting loss computation to geometrically valid regions.
Achieving state-of-the-art zero-shot performance compared to prior multi-view trained methods while being significantly faster (600x) at inference.

Limitations

The method relies on an external depth estimator during training, which introduces dependence on the accuracy and generalizability of that scaffold. Performance might be limited by the uncurated nature of the massive training data.

Key Concepts

OVIE: A model for monocular novel view generation trained entirely on unpaired internet images without requiring explicit 3D representations or depth estimation at inference time.

Datasets

internet images (uncurated)

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Summary

Key Contributions

Limitations

Key Concepts

Datasets

Limitations

Links

Metadata & Links