12,407 research outputs found
A Survey on Generative Diffusion Model
Deep learning shows excellent potential in generation tasks thanks to deep
latent representation. Generative models are classes of models that can
generate observations randomly concerning certain implied parameters. Recently,
the diffusion Model has become a rising class of generative models by its
power-generating ability. Nowadays, great achievements have been reached. More
applications except for computer vision, speech generation, bioinformatics, and
natural language processing are to be explored in this field. However, the
diffusion model has its genuine drawback of a slow generation process, single
data types, low likelihood, and the inability for dimension reduction. They are
leading to many enhanced works. This survey makes a summary of the field of the
diffusion model. We first state the main problem with two landmark works --
DDPM and DSM, and a unified landmark work -- Score SDE. Then, we present
improved techniques for existing problems in the diffusion-based model field,
including speed-up improvement For model speed-up improvement, data structure
diversification, likelihood optimization, and dimension reduction. Regarding
existing models, we also provide a benchmark of FID score, IS, and NLL
according to specific NFE. Moreover, applications with diffusion models are
introduced including computer vision, sequence modeling, audio, and AI for
science. Finally, there is a summarization of this field together with
limitations \& further directions. The summation of existing well-classified
methods is in our
Github:https://github.com/chq1155/A-Survey-on-Generative-Diffusion-Model
HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Recent advances in text-to-image diffusion models have enabled 3D generation
from a single image. However, current image-to-3D methods often produce
suboptimal results for novel views, with blurred textures and deviations from
the reference image, limiting their practical applications. In this paper, we
introduce HiFi-123, a method designed for high-fidelity and multi-view
consistent 3D generation. Our contributions are twofold: First, we propose a
reference-guided novel view enhancement technique that substantially reduces
the quality gap between synthesized and reference views. Second, capitalizing
on the novel view enhancement, we present a novel reference-guided state
distillation loss. When incorporated into the optimization-based image-to-3D
pipeline, our method significantly improves 3D generation quality, achieving
state-of-the-art performance. Comprehensive evaluations demonstrate the
effectiveness of our approach over existing methods, both qualitatively and
quantitatively
Manipulating Attributes of Natural Scenes via Hallucination
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.Comment: Accepted for publication in ACM Transactions on Graphic
Semantically Guided Depth Upsampling
We present a novel method for accurate and efficient up- sampling of sparse
depth data, guided by high-resolution imagery. Our approach goes beyond the use
of intensity cues only and additionally exploits object boundary cues through
structured edge detection and semantic scene labeling for guidance. Both cues
are combined within a geodesic distance measure that allows for
boundary-preserving depth in- terpolation while utilizing local context. We
model the observed scene structure by locally planar elements and formulate the
upsampling task as a global energy minimization problem. Our method determines
glob- ally consistent solutions and preserves fine details and sharp depth
bound- aries. In our experiments on several public datasets at different levels
of application, we demonstrate superior performance of our approach over the
state-of-the-art, even for very sparse measurements.Comment: German Conference on Pattern Recognition 2016 (Oral
Pose-Guided High-Resolution Appearance Transfer via Progressive Training
We propose a novel pose-guided appearance transfer network for transferring a
given reference appearance to a target pose in unprecedented image resolution
(1024 * 1024), given respectively an image of the reference and target person.
No 3D model is used. Instead, our network utilizes dense local descriptors
including local perceptual loss and local discriminators to refine details,
which is trained progressively in a coarse-to-fine manner to produce the
high-resolution output to faithfully preserve complex appearance of garment
textures and geometry, while hallucinating seamlessly the transferred
appearances including those with dis-occlusion. Our progressive encoder-decoder
architecture can learn the reference appearance inherent in the input image at
multiple scales. Extensive experimental results on the Human3.6M dataset, the
DeepFashion dataset, and our dataset collected from YouTube show that our model
produces high-quality images, which can be further utilized in useful
applications such as garment transfer between people and pose-guided human
video generation.Comment: 10 pages, 10 figures, 2 table
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
We present VideoReTalking, a new system to edit the faces of a real-world
talking head video according to input audio, producing a high-quality and
lip-syncing output video even with a different emotion. Our system disentangles
this objective into three sequential tasks: (1) face video generation with a
canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for
improving photo-realism. Given a talking-head video, we first modify the
expression of each frame according to the same expression template using the
expression editing network, resulting in a video with the canonical expression.
This video, together with the given audio, is then fed into the lip-sync
network to generate a lip-syncing video. Finally, we improve the photo-realism
of the synthesized faces through an identity-aware face enhancement network and
post-processing. We use learning-based approaches for all three steps and all
our modules can be tackled in a sequential pipeline without any user
intervention. Furthermore, our system is a generic approach that does not need
to be retrained to a specific person. Evaluations on two widely-used datasets
and in-the-wild examples demonstrate the superiority of our framework over
other state-of-the-art methods in terms of lip-sync accuracy and visual
quality.Comment: Accepted by SIGGRAPH Asia 2022 Conference Proceedings. Project page:
https://vinthony.github.io/video-retalking
- …