15,968 research outputs found
Diffusion-based Image Translation using Disentangled Style and Content Representation
Diffusion-based image translation guided by semantic texts or a single target
image has enabled flexible style transfer which is not limited to the specific
domains. Unfortunately, due to the stochastic nature of diffusion models, it is
often difficult to maintain the original content of the image during the
reverse diffusion. To address this, here we present a novel diffusion-based
unsupervised image translation method using disentangled style and content
representation.
Specifically, inspired by the splicing Vision Transformer, we extract
intermediate keys of multihead self attention layer from ViT model and used
them as the content preservation loss. Then, an image guided style transfer is
performed by matching the [CLS] classification token from the denoised samples
and target image, whereas additional CLIP loss is used for the text-driven
style transfer. To further accelerate the semantic change during the reverse
diffusion, we also propose a novel semantic divergence loss and resampling
strategy. Our experimental results show that the proposed method outperforms
state-of-the-art baseline models in both text-guided and image-guided
translation tasks
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation
One key challenge of exemplar-guided image generation lies in establishing
fine-grained correspondences between input and guided images. Prior approaches,
despite the promising results, have relied on either estimating dense attention
to compute per-point matching, which is limited to only coarse scales due to
the quadratic memory cost, or fixing the number of correspondences to achieve
linear complexity, which lacks flexibility. In this paper, we propose a dynamic
sparse attention based Transformer model, termed Dynamic Sparse Transformer
(DynaST), to achieve fine-level matching with favorable efficiency. The heart
of our approach is a novel dynamic-attention unit, dedicated to covering the
variation on the optimal number of tokens one position should focus on.
Specifically, DynaST leverages the multi-layer nature of Transformer structure,
and performs the dynamic attention scheme in a cascaded manner to refine
matching results and synthesize visually-pleasing outputs. In addition, we
introduce a unified training objective for DynaST, making it a versatile
reference-based image translation framework for both supervised and
unsupervised scenarios. Extensive experiments on three applications,
pose-guided person image generation, edge-based face synthesis, and undistorted
image style transfer, demonstrate that DynaST achieves superior performance in
local details, outperforming the state of the art while reducing the
computational cost significantly. Our code is available at
https://github.com/Huage001/DynaSTComment: ECCV 202
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
- …