592 research outputs found
Attentive VQ-VAE
We present a novel approach to enhance the capabilities of VQ-VAE models
through the integration of a Residual Encoder and a Residual Pixel Attention
layer, named Attentive Residual Encoder (AREN). The objective of our research
is to improve the performance of VQ-VAE while maintaining practical parameter
levels. The AREN encoder is designed to operate effectively at multiple levels,
accommodating diverse architectural complexities. The key innovation is the
integration of an inter-pixel auto-attention mechanism into the AREN encoder.
This approach allows us to efficiently capture and utilize contextual
information across latent vectors. Additionally, our models uses additional
encoding levels to further enhance the model's representational power. Our
attention layer employs a minimal parameter approach, ensuring that latent
vectors are modified only when pertinent information from other pixels is
available. Experimental results demonstrate that our proposed modifications
lead to significant improvements in data representation and generation, making
VQ-VAEs even more suitable for a wide range of applications as the presented.Comment: 5 pages, 4 figures, 2 table2, 1 pseudo-cod
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
Although two-stage Vector Quantized (VQ) generative models allow for
synthesizing high-fidelity and high-resolution images, their quantization
operator encodes similar patches within an image into the same index, resulting
in a repeated artifact for similar adjacent regions using existing decoder
architectures. To address this issue, we propose to incorporate the spatially
conditional normalization to modulate the quantized vectors so as to insert
spatially variant information to the embedded index maps, encouraging the
decoder to generate more photorealistic images. Moreover, we use multichannel
quantization to increase the recombination capability of the discrete codes
without increasing the cost of model and codebook. Additionally, to generate
discrete tokens at the second stage, we adopt a Masked Generative Image
Transformer (MaskGIT) to learn an underlying prior distribution in the
compressed latent space, which is much faster than the conventional
autoregressive model. Experiments on two benchmark datasets demonstrate that
our proposed modulated VQGAN is able to greatly improve the reconstructed image
quality as well as provide high-fidelity image generation
Binary Latent Diffusion
In this paper, we show that a binary latent space can be explored for compact
yet expressive image representations. We model the bi-directional mappings
between an image and the corresponding latent binary representation by training
an auto-encoder with a Bernoulli encoding distribution. On the one hand, the
binary latent space provides a compact discrete image representation of which
the distribution can be modeled more efficiently than pixels or continuous
latent representations. On the other hand, we now represent each image patch as
a binary vector instead of an index of a learned cookbook as in discrete image
representations with vector quantization. In this way, we obtain binary latent
representations that allow for better image quality and high-resolution image
representations without any multi-stage hierarchy in the latent space. In this
binary latent space, images can now be generated effectively using a binary
latent diffusion model tailored specifically for modeling the prior over the
binary image representations. We present both conditional and unconditional
image generation experiments with multiple datasets, and show that the proposed
method performs comparably to state-of-the-art methods while dramatically
improving the sampling efficiency to as few as 16 steps without using any
test-time acceleration. The proposed framework can also be seamlessly scaled to
high-resolution image generation without resorting to latent
hierarchy or multi-stage refinements
TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text
Language is one of the primary means by which we describe the 3D world around
us. While rapid progress has been made in text-to-2D-image synthesis, similar
progress in text-to-3D-shape synthesis has been hindered by the lack of paired
(text, shape) data. Moreover, extant methods for text-to-shape generation have
limited shape diversity and fidelity. We introduce TextCraft, a method to
address these limitations by producing high-fidelity and diverse 3D shapes
without the need for (text, shape) pairs for training. TextCraft achieves this
by using CLIP and using a multi-resolution approach by first generating in a
low-dimensional latent space and then upscaling to a higher resolution,
improving the fidelity of the generated shape. To improve shape diversity, we
use a discrete latent space which is modelled using a bidirectional transformer
conditioned on the interchangeable image-text embedding space induced by CLIP.
Moreover, we present a novel variant of classifier-free guidance, which further
improves the accuracy-diversity trade-off. Finally, we perform extensive
experiments that demonstrate that TextCraft outperforms state-of-the-art
baselines
- …