Single-view novel view synthesis, the task of generating images from new
viewpoints based on a single reference image, is an important but challenging
task in computer vision. Recently, Denoising Diffusion Probabilistic Model
(DDPM) has become popular in this area due to its strong ability to generate
high-fidelity images. However, current diffusion-based methods directly rely on
camera pose matrices as viewing conditions, globally and implicitly introducing
3D constraints. These methods may suffer from inconsistency among generated
images from different perspectives, especially in regions with intricate
textures and structures. In this work, we present Light Field Diffusion (LFD),
a conditional diffusion-based model for single-view novel view synthesis.
Unlike previous methods that employ camera pose matrices, LFD transforms the
camera view information into light field encoding and combines it with the
reference image. This design introduces local pixel-wise constraints within the
diffusion models, thereby encouraging better multi-view consistency.
Experiments on several datasets show that our LFD can efficiently generate
high-fidelity images and maintain better 3D consistency even in intricate
regions. Our method can generate images with higher quality than NeRF-based
models, and we obtain sample quality similar to other diffusion-based models
but with only one-third of the model size