25 research outputs found
Free3D: consistent novel view synthesis without 3D representation
We introduce Free3D, a simple accurate method for
monocular open-set novel view synthesis (NVS). Similar to
Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared
to other works that took a similar approach, we obtain significant improvements without resorting to an explicit 3D
representation, which is slow and memory-consuming, and
without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target
camera pose is encoded in the network, which we do by
introducing a new ray conditioning normalization (RCN)
layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing
direction. We further improve multi-view consistency by
using light-weight multi-view attention layers and by sharing generation noise between the different views. We train
Free3D on the Objaverse dataset and demonstrate excellent
generalization to new categories in new datasets, including
OmniObject3D and GSO. The project page is available at
https://chuanxiaz.com/free3d/
Pluralistic Image Completion
Most image completion methods produce only one result for each masked input,
although there may be many reasonable possibilities. In this paper, we present
an approach for \textbf{pluralistic image completion} -- the task of generating
multiple and diverse plausible solutions for image completion. A major
challenge faced by learning-based approaches is that usually only one ground
truth training instance per label. As such, sampling from conditional VAEs
still leads to minimal diversity. To overcome this, we propose a novel and
probabilistically principled framework with two parallel paths. One is a
reconstructive path that utilizes the only one given ground truth to get prior
distribution of missing parts and rebuild the original image from this
distribution. The other is a generative path for which the conditional prior is
coupled to the distribution obtained in the reconstructive path. Both are
supported by GANs. We also introduce a new short+long term attention layer that
exploits distant relations among decoder and encoder features, improving
appearance consistency. When tested on datasets with buildings (Paris), faces
(CelebA-HQ), and natural images (ImageNet), our method not only generated
higher-quality completion results, but also with multiple and diverse plausible
outputs.Comment: 21 pages, 16 figure
T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks
Current methods for single-image depth estimation use training datasets with
real image-depth pairs or stereo pairs, which are not easy to acquire. We
propose a framework, trained on synthetic image-depth pairs and unpaired real
images, that comprises an image translation network for enhancing realism of
input images, followed by a depth prediction network. A key idea is having the
first network act as a wide-spectrum input translator, taking in either
synthetic or real images, and ideally producing minimally modified realistic
images. This is done via a reconstruction loss when the training input is real,
and GAN loss when synthetic, removing the need for heuristic
self-regularization. The second network is trained on a task loss for synthetic
image-depth pairs, with extra GAN loss to unify real and synthetic feature
distributions. Importantly, the framework can be trained end-to-end, leading to
good results, even surpassing early deep-learning methods that use real paired
data.Comment: 15 pages, 8 figure
IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model
Generating complete 360-degree panoramas from narrow field of view images is
ongoing research as omnidirectional RGB data is not readily available. Existing
GAN-based approaches face some barriers to achieving higher quality output, and
have poor generalization performance over different mask types. In this paper,
we present our 360-degree indoor RGB panorama outpainting model using latent
diffusion models (LDM), called IPO-LDM. We introduce a new bi-modal latent
diffusion structure that utilizes both RGB and depth panoramic data during
training, but works surprisingly well to outpaint normal depth-free RGB images
during inference. We further propose a novel technique of introducing
progressive camera rotations during each diffusion denoising step, which leads
to substantial improvement in achieving panorama wraparound consistency.
Results show that our IPO-LDM not only significantly outperforms
state-of-the-art methods on RGB panorama outpainting, but can also produce
multiple and diverse well-structured results for different types of masks
What Does Stable Diffusion Know about the 3D Scene?
Recent advances in generative models like Stable Diffusion enable the
generation of highly photo-realistic images. Our objective in this paper is to
probe the diffusion network to determine to what extent it 'understands'
different properties of the 3D scene depicted in an image. To this end, we make
the following contributions: (i) We introduce a protocol to evaluate whether a
network models a number of physical 'properties' of the 3D scene by probing for
explicit features that represent these properties. The probes are applied on
datasets of real images with annotations for the property. (ii) We apply this
protocol to properties covering scene geometry, scene material, support
relations, lighting, and view dependent measures. (iii) We find that Stable
Diffusion is good at a number of properties including scene geometry, support
relations, shadows and depth, but less performant for occlusion. (iv) We also
apply the probes to other models trained at large-scale, including DINO and
CLIP, and find their performance inferior to that of Stable Diffusion
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
Although two-stage Vector Quantized (VQ) generative models allow for
synthesizing high-fidelity and high-resolution images, their quantization
operator encodes similar patches within an image into the same index, resulting
in a repeated artifact for similar adjacent regions using existing decoder
architectures. To address this issue, we propose to incorporate the spatially
conditional normalization to modulate the quantized vectors so as to insert
spatially variant information to the embedded index maps, encouraging the
decoder to generate more photorealistic images. Moreover, we use multichannel
quantization to increase the recombination capability of the discrete codes
without increasing the cost of model and codebook. Additionally, to generate
discrete tokens at the second stage, we adopt a Masked Generative Image
Transformer (MaskGIT) to learn an underlying prior distribution in the
compressed latent space, which is much faster than the conventional
autoregressive model. Experiments on two benchmark datasets demonstrate that
our proposed modulated VQGAN is able to greatly improve the reconstructed image
quality as well as provide high-fidelity image generation
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls
It is well known that many open-released foundational diffusion models have
difficulty in generating images that substantially depart from average
brightness, despite such images being present in the training data. This is due
to an inconsistency: while denoising starts from pure Gaussian noise during
inference, the training noise schedule retains residual data even in the final
timestep distribution, due to difficulties in numerical conditioning in
mainstream formulation, leading to unintended bias during inference. To
mitigate this issue, certain -prediction models are combined with an
ad-hoc offset-noise methodology. In parallel, some contemporary models have
adopted zero-terminal SNR noise schedules together with
-prediction, which necessitate major alterations to pre-trained
models. However, such changes risk destabilizing a large multitude of
community-driven applications anchored on these pre-trained models. In light of
this, our investigation revisits the fundamental causes, leading to our
proposal of an innovative and principled remedy, called One More Step (OMS). By
integrating a compact network and incorporating an additional simple yet
effective step during inference, OMS elevates image fidelity and harmonizes the
dichotomy between training and inference, while preserving original model
parameters. Once trained, various pre-trained diffusion models with the same
latent domain can share the same OMS module.Comment: Project Page: https://jabir-zheng.github.io/OneMoreStep/, Demo Page:
https://huggingface.co/spaces/h1t/oms_sdxl_lc