6,158 research outputs found
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Large text-to-image diffusion models have exhibited impressive proficiency in
generating high-quality images. However, when applying these models to video
domain, ensuring temporal consistency across video frames remains a formidable
challenge. This paper proposes a novel zero-shot text-guided video-to-video
translation framework to adapt image models to videos. The framework includes
two parts: key frame translation and full video translation. The first part
uses an adapted diffusion model to generate key frames, with hierarchical
cross-frame constraints applied to enforce coherence in shapes, textures and
colors. The second part propagates the key frames to other frames with
temporal-aware patch matching and frame blending. Our framework achieves global
style and local texture temporal consistency at a low cost (without re-training
or optimization). The adaptation is compatible with existing image diffusion
techniques, allowing our framework to take advantage of them, such as
customizing a specific subject with LoRA, and introducing extra spatial
guidance with ControlNet. Extensive experimental results demonstrate the
effectiveness of our proposed framework over existing methods in rendering
high-quality and temporally-coherent videos.Comment: Accepted to SIGGRAPH Asia 2023. Project page:
https://www.mmlab-ntu.com/project/rerender
Volumetric cloud generation using a Chinese brush calligraphy style
Includes bibliographical references.Clouds are an important feature of any real or simulated environment in which the sky is visible. Their amorphous, ever-changing and illuminated features make the sky vivid and beautiful. However, these features increase both the complexity of real time rendering and modelling. It is difficult to design and build volumetric clouds in an easy and intuitive way, particularly if the interface is intended for artists rather than programmers. We propose a novel modelling system motivated by an ancient painting style, Chinese Landscape Painting, to address this problem. With the use of only one brush and one colour, an artist can paint a vivid and detailed landscape efficiently. In this research, we develop three emulations of a Chinese brush: a skeleton-based brush, a 2D texture footprint and a dynamic 3D footprint, all driven by the motion and pressure of a stylus pen. We propose a hybrid mapping to generate both the body and surface of volumetric clouds from the brush footprints. Our interface integrates these components along with 3D canvas control and GPU-based volumetric rendering into an interactive cloud modelling system. Our cloud modelling system is able to create various types of clouds occurring in nature. User tests indicate that our brush calligraphy approach is preferred to conventional volumetric cloud modelling and that it produces convincing 3D cloud formations in an intuitive and interactive fashion. While traditional modelling systems focus on surface generation of 3D objects, our brush calligraphy technique constructs the interior structure. This forms the basis of a new modelling style for objects with amorphous shape
Controllable Multi-domain Semantic Artwork Synthesis
We present a novel framework for multi-domain synthesis of artwork from
semantic layouts. One of the main limitations of this challenging task is the
lack of publicly available segmentation datasets for art synthesis. To address
this problem, we propose a dataset, which we call ArtSem, that contains 40,000
images of artwork from 4 different domains with their corresponding semantic
label maps. We generate the dataset by first extracting semantic maps from
landscape photography and then propose a conditional Generative Adversarial
Network (GAN)-based approach to generate high-quality artwork from the semantic
maps without necessitating paired training data. Furthermore, we propose an
artwork synthesis model that uses domain-dependent variational encoders for
high-quality multi-domain synthesis. The model is improved and complemented
with a simple but effective normalization method, based on normalizing both the
semantic and style jointly, which we call Spatially STyle-Adaptive
Normalization (SSTAN). In contrast to previous methods that only take semantic
layout as input, our model is able to learn a joint representation of both
style and semantic information, which leads to better generation quality for
synthesizing artistic images. Results indicate that our model learns to
separate the domains in the latent space, and thus, by identifying the
hyperplanes that separate the different domains, we can also perform
fine-grained control of the synthesized artwork. By combining our proposed
dataset and approach, we are able to generate user-controllable artwork that is
of higher quality than existingComment: 15 pages, accepted by CVMJ, to appea
Semantic Photo Manipulation with a Generative Image Prior
Despite the recent success of GANs in synthesizing images conditioned on
inputs such as a user sketch, text, or semantic labels, manipulating the
high-level attributes of an existing natural photograph with GANs is
challenging for two reasons. First, it is hard for GANs to precisely reproduce
an input image. Second, after manipulation, the newly synthesized pixels often
do not fit the original image. In this paper, we address these issues by
adapting the image prior learned by GANs to image statistics of an individual
image. Our method can accurately reconstruct the input image and synthesize new
content, consistent with the appearance of the input image. We demonstrate our
interactive system on several semantic image editing tasks, including
synthesizing new objects consistent with background, removing unwanted objects,
and changing the appearance of an object. Quantitative and qualitative
comparisons against several existing methods demonstrate the effectiveness of
our method.Comment: SIGGRAPH 201
DiffUTE: Universal Text Editing Diffusion Model
Diffusion model based language-guided image editing has achieved great
success recently. However, existing state-of-the-art diffusion models struggle
with rendering correct text and text style during generation. To tackle this
problem, we propose a universal self-supervised text editing diffusion model
(DiffUTE), which aims to replace or modify words in the source image with
another one while maintaining its realistic appearance. Specifically, we build
our model on a diffusion model and carefully modify the network structure to
enable the model for drawing multilingual characters with the help of glyph and
position information. Moreover, we design a self-supervised learning framework
to leverage large amounts of web data to improve the representation ability of
the model. Experimental results show that our method achieves an impressive
performance and enables controllable editing on in-the-wild images with high
fidelity. Our code will be avaliable in
\url{https://github.com/chenhaoxing/DiffUTE}
- …