Home / Papers / Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Authors: Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein Date: 2026-03-24 Paper ID: arxiv:2603.23491

Summary

This paper introduces Foveated Diffusion to address the computational bottlenecks of high-resolution image and video generation by exploiting human visual acuity, which concentrates high detail perception around the gaze point (fovea). The method employs a foveation mask to allocate model tokens non-uniformly, prioritizing higher token density in the foveal region and lower density in the periphery. A novel mechanism enables the construction of these mixed-resolution tokens from high-resolution data, facilitating the post-training of existing diffusion models onto this efficient paradigm without sacrificing content fidelity. Extensive evaluation confirms that the approach yields results perceptually equivalent to full-resolution generation while substantially reducing the required token count and inference time.

Key Contributions

Developed a mixed-resolution token construction mechanism that allows for non-uniform token density based on a foveation mask derived from gaze location.
Demonstrated that Foveated Diffusion generates content perceptually indistinguishable from full-resolution models while achieving significant reductions in token count and generation time.
Created a principled post-training procedure to adapt existing base diffusion models to the mixed-resolution token setting while preserving content consistency.
Validated the efficiency and perceptual quality improvements through extensive analysis and a user study confirming the practical scalability of foveation in generative models.

Limitations

The efficiency gain is contingent upon the availability and accuracy of the gaze location (foveal region estimate).

Open Questions & Future Work

Key Concepts

Foveated Diffusion: A method for efficient image and video generation that leverages the non-uniform detail perception of human vision by allocating model tokens based on a predicted gaze location.

Limitations

The efficiency gain is contingent upon the availability and accuracy of the gaze location (foveal region estimate).

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Summary

Key Contributions

Limitations

Open Questions & Future Work

Key Concepts

Limitations

Links

Metadata & Links