179 research outputs found
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
A central problem in machine learning involves modeling complex data-sets
using highly flexible families of probability distributions in which learning,
sampling, inference, and evaluation are still analytically or computationally
tractable. Here, we develop an approach that simultaneously achieves both
flexibility and tractability. The essential idea, inspired by non-equilibrium
statistical physics, is to systematically and slowly destroy structure in a
data distribution through an iterative forward diffusion process. We then learn
a reverse diffusion process that restores structure in data, yielding a highly
flexible and tractable generative model of the data. This approach allows us to
rapidly learn, sample from, and evaluate probabilities in deep generative
models with thousands of layers or time steps, as well as to compute
conditional and posterior probabilities under the learned model. We
additionally release an open source reference implementation of the algorithm
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model
The generation of co-speech gestures for digital humans is an emerging area
in the field of virtual human creation. Prior research has made progress by
using acoustic and semantic information as input and adopting classify method
to identify the person's ID and emotion for driving co-speech gesture
generation. However, this endeavour still faces significant challenges. These
challenges go beyond the intricate interplay between co-speech gestures, speech
acoustic, and semantics; they also encompass the complexities associated with
personality, emotion, and other obscure but important factors. This paper
introduces "diffmotion-v2," a speech-conditional diffusion-based and
non-autoregressive transformer-based generative model with WavLM pre-trained
model. It can produce individual and stylized full-body co-speech gestures only
using raw speech audio, eliminating the need for complex multimodal processing
and manually annotated. Firstly, considering that speech audio not only
contains acoustic and semantic features but also conveys personality traits,
emotions, and more subtle information related to accompanying gestures, we
pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract
low-level and high-level audio information. Secondly, we introduce an adaptive
layer norm architecture in the transformer-based layer to learn the
relationship between speech information and accompanying gestures. Extensive
subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT
datasets to confirm the WavLM and the model's ability to synthesize natural
co-speech gestures with various styles.Comment: 10 pages, 5 figures, 1 tabl
A note on the evaluation of generative models
Probabilistic generative models can be used for compression, denoising,
inpainting, texture synthesis, semi-supervised learning, unsupervised feature
learning, and other tasks. Given this wide range of applications, it is not
surprising that a lot of heterogeneity exists in the way these models are
formulated, trained, and evaluated. As a consequence, direct comparison between
models is often difficult. This article reviews mostly known but often
underappreciated properties relating to the evaluation and interpretation of
generative models with a focus on image models. In particular, we show that
three of the currently most commonly used criteria---average log-likelihood,
Parzen window estimates, and visual fidelity of samples---are largely
independent of each other when the data is high-dimensional. Good performance
with respect to one criterion therefore need not imply good performance with
respect to the other criteria. Our results show that extrapolation from one
criterion to another is not warranted and generative models need to be
evaluated directly with respect to the application(s) they were intended for.
In addition, we provide examples demonstrating that Parzen window estimates
should generally be avoided
Generative Image Modeling Using Spatial LSTMs
Modeling the distribution of natural images is challenging, partly because of
strong statistical dependencies which can extend over hundreds of pixels.
Recurrent neural networks have been successful in capturing long-range
dependencies in a number of problems but only recently have found their way
into generative image models. We here introduce a recurrent image model based
on multi-dimensional long short-term memory units which are particularly suited
for image modeling due to their spatial structure. Our model scales to images
of arbitrary size and its likelihood is computationally tractable. We find that
it outperforms the state of the art in quantitative comparisons on several
image datasets and produces promising results when used for texture synthesis
and inpainting
- …