1,509 research outputs found
LISA: Localized Image Stylization with Audio via Implicit Neural Representation
We present a novel framework, Localized Image Stylization with Audio (LISA)
which performs audio-driven localized image stylization. Sound often provides
information about the specific context of the scene and is closely related to a
certain part of the scene or object. However, existing image stylization works
have focused on stylizing the entire image using an image or text input.
Stylizing a particular part of the image based on audio input is natural but
challenging. In this work, we propose a framework that a user provides an audio
input to localize the sound source in the input image and another for locally
stylizing the target object or scene. LISA first produces a delicate
localization map with an audio-visual localization network by leveraging CLIP
embedding space. We then utilize implicit neural representation (INR) along
with the predicted localization map to stylize the target object or scene based
on sound information. The proposed INR can manipulate the localized pixel
values to be semantically consistent with the provided audio input. Through a
series of experiments, we show that the proposed framework outperforms the
other audio-guided stylization methods. Moreover, LISA constructs concise
localization maps and naturally manipulates the target object or scene in
accordance with the given audio input
Manipulating Attributes of Natural Scenes via Hallucination
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.Comment: Accepted for publication in ACM Transactions on Graphic
Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details
We propose Text2Scene, a method to automatically create realistic textures
for virtual scenes composed of multiple objects. Guided by a reference image
and text descriptions, our pipeline adds detailed texture on labeled 3D
geometries in the room such that the generated colors respect the hierarchical
structure or semantic parts that are often composed of similar materials.
Instead of applying flat stylization on the entire scene at a single step, we
obtain weak semantic cues from geometric segmentation, which are further
clarified by assigning initial colors to segmented parts. Then we add texture
details for individual objects such that their projections on image space
exhibit feature embedding aligned with the embedding of the input. The
decomposition makes the entire pipeline tractable to a moderate amount of
computation resources and memory. As our framework utilizes the existing
resources of image and text embedding, it does not require dedicated datasets
with high-quality textures designed by skillful artists. To the best of our
knowledge, it is the first practical and scalable approach that can create
detailed and realistic textures of the desired style that maintain structural
context for scenes with multiple objects.Comment: Accepted to CVPR 202
ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization
The radiance fields style transfer is an emerging field that has recently
gained popularity as a means of 3D scene stylization, thanks to the outstanding
performance of neural radiance fields in 3D reconstruction and view synthesis.
We highlight a research gap in radiance fields style transfer, the lack of
sufficient perceptual controllability, motivated by the existing concept in the
2D image style transfer. In this paper, we present ARF-Plus, a 3D neural style
transfer framework offering manageable control over perceptual factors, to
systematically explore the perceptual controllability in 3D scene stylization.
Four distinct types of controls - color preservation control, (style pattern)
scale control, spatial (selective stylization area) control, and depth
enhancement control - are proposed and integrated into this framework. Results
from real-world datasets, both quantitative and qualitative, show that the four
types of controls in our ARF-Plus framework successfully accomplish their
corresponding perceptual controls when stylizing 3D scenes. These techniques
work well for individual style inputs as well as for the simultaneous
application of multiple styles within a scene. This unlocks a realm of
limitless possibilities, allowing customized modifications of stylization
effects and flexible merging of the strengths of different styles, ultimately
enabling the creation of novel and eye-catching stylistic effects on 3D scenes
GenText: Unsupervised Artistic Text Generation via Decoupled Font and Texture Manipulation
Automatic artistic text generation is an emerging topic which receives
increasing attention due to its wide applications. The artistic text can be
divided into three components, content, font, and texture, respectively.
Existing artistic text generation models usually focus on manipulating one
aspect of the above components, which is a sub-optimal solution for
controllable general artistic text generation. To remedy this issue, we propose
a novel approach, namely GenText, to achieve general artistic text style
transfer by separably migrating the font and texture styles from the different
source images to the target images in an unsupervised manner. Specifically, our
current work incorporates three different stages, stylization, destylization,
and font transfer, respectively, into a unified platform with a single powerful
encoder network and two separate style generator networks, one for font
transfer, the other for stylization and destylization. The destylization stage
first extracts the font style of the font reference image, then the font
transfer stage generates the target content with the desired font style.
Finally, the stylization stage renders the resulted font image with respect to
the texture style in the reference image. Moreover, considering the difficult
data acquisition of paired artistic text images, our model is designed under
the unsupervised setting, where all stages can be effectively optimized from
unpaired data. Qualitative and quantitative results are performed on artistic
text benchmarks, which demonstrate the superior performance of our proposed
model. The code with models will become publicly available in the future
- …