1,275 research outputs found
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation
One key challenge of exemplar-guided image generation lies in establishing
fine-grained correspondences between input and guided images. Prior approaches,
despite the promising results, have relied on either estimating dense attention
to compute per-point matching, which is limited to only coarse scales due to
the quadratic memory cost, or fixing the number of correspondences to achieve
linear complexity, which lacks flexibility. In this paper, we propose a dynamic
sparse attention based Transformer model, termed Dynamic Sparse Transformer
(DynaST), to achieve fine-level matching with favorable efficiency. The heart
of our approach is a novel dynamic-attention unit, dedicated to covering the
variation on the optimal number of tokens one position should focus on.
Specifically, DynaST leverages the multi-layer nature of Transformer structure,
and performs the dynamic attention scheme in a cascaded manner to refine
matching results and synthesize visually-pleasing outputs. In addition, we
introduce a unified training objective for DynaST, making it a versatile
reference-based image translation framework for both supervised and
unsupervised scenarios. Extensive experiments on three applications,
pose-guided person image generation, edge-based face synthesis, and undistorted
image style transfer, demonstrate that DynaST achieves superior performance in
local details, outperforming the state of the art while reducing the
computational cost significantly. Our code is available at
https://github.com/Huage001/DynaSTComment: ECCV 202
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
While language-guided image manipulation has made remarkable progress, the
challenge of how to instruct the manipulation process faithfully reflecting
human intentions persists. An accurate and comprehensive description of a
manipulation task using natural language is laborious and sometimes even
impossible, primarily due to the inherent uncertainty and ambiguity present in
linguistic expressions. Is it feasible to accomplish image manipulation without
resorting to external cross-modal language information? If this possibility
exists, the inherent modality gap would be effortlessly eliminated. In this
paper, we propose a novel manipulation methodology, dubbed ImageBrush, that
learns visual instructions for more accurate image editing. Our key idea is to
employ a pair of transformation images as visual instructions, which not only
precisely captures human intention but also facilitates accessibility in
real-world scenarios. Capturing visual instructions is particularly challenging
because it involves extracting the underlying intentions solely from visual
demonstrations and then applying this operation to a new image. To address this
challenge, we formulate visual instruction learning as a diffusion-based
inpainting problem, where the contextual information is fully exploited through
an iterative process of generation. A visual prompting encoder is carefully
devised to enhance the model's capacity in uncovering human intent behind the
visual instructions. Extensive experiments show that our method generates
engaging manipulation results conforming to the transformations entailed in
demonstrations. Moreover, our model exhibits robust generalization capabilities
on various downstream tasks such as pose transfer, image translation and video
inpainting
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
SAIR: Learning Semantic-aware Implicit Representation
Implicit representation of an image can map arbitrary coordinates in the
continuous domain to their corresponding color values, presenting a powerful
capability for image reconstruction. Nevertheless, existing implicit
representation approaches only focus on building continuous appearance mapping,
ignoring the continuities of the semantic information across pixels. As a
result, they can hardly achieve desired reconstruction results when the
semantic information within input images is corrupted, for example, a large
region misses. To address the issue, we propose to learn semantic-aware
implicit representation (SAIR), that is, we make the implicit representation of
each pixel rely on both its appearance and semantic information (\eg, which
object does the pixel belong to). To this end, we propose a framework with two
modules: (1) building a semantic implicit representation (SIR) for a corrupted
image whose large regions miss. Given an arbitrary coordinate in the continuous
domain, we can obtain its respective text-aligned embedding indicating the
object the pixel belongs. (2) building an appearance implicit representation
(AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain,
we can reconstruct its color whether or not the pixel is missed in the input.
We validate the novel semantic-aware implicit representation method on the
image inpainting task, and the extensive experiments demonstrate that our
method surpasses state-of-the-art approaches by a significant margin
- …