1,876 research outputs found
Conditional WaveGAN
Generative models are successfully used for image synthesis in the recent
years. But when it comes to other modalities like audio, text etc little
progress has been made. Recent works focus on generating audio from a
generative model in an unsupervised setting. We explore the possibility of
using generative models conditioned on class labels. Concatenation based
conditioning and conditional scaling were explored in this work with various
hyper-parameter tuning methods. In this paper we introduce Conditional WaveGANs
(cWaveGAN). Find our implementation at https://github.com/acheketa/cwaveganComment: Preprin
Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks
We present a novel approach to generating photo-realistic images of a face
with accurate lip sync, given an audio input. By using a recurrent neural
network, we achieved mouth landmarks based on audio features. We exploited the
power of conditional generative adversarial networks to produce
highly-realistic face conditioned on a set of landmarks. These two networks
together are capable of producing a sequence of natural faces in sync with an
input audio track.Comment: Submitted for ECCV 201
Adversarial Audio Synthesis
Audio signals are sampled at high temporal resolutions, and learning to
synthesize audio requires capturing structure across a range of timescales.
Generative adversarial networks (GANs) have seen wide success at generating
images that are both locally and globally coherent, but they have seen little
application to audio generation. In this paper we introduce WaveGAN, a first
attempt at applying GANs to unsupervised synthesis of raw-waveform audio.
WaveGAN is capable of synthesizing one second slices of audio waveforms with
global coherence, suitable for sound effect generation. Our experiments
demonstrate that, without labels, WaveGAN learns to produce intelligible words
when trained on a small-vocabulary speech dataset, and can also synthesize
audio from other domains such as drums, bird vocalizations, and piano. We
compare WaveGAN to a method which applies GANs designed for image generation on
image-like audio feature representations, finding both approaches to be
promising.Comment: Published as a conference paper at ICLR 201
Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks
Speech is a rich biometric signal that contains information about the
identity, gender and emotional state of the speaker. In this work, we explore
its potential to generate face images of a speaker by conditioning a Generative
Adversarial Network (GAN) with raw speech input. We propose a deep neural
network that is trained from scratch in an end-to-end fashion, generating a
face directly from the raw speech waveform without any additional identity
information (e.g reference image or one-hot encoding). Our model is trained in
a self-supervised approach by exploiting the audio and visual signals naturally
aligned in videos. With the purpose of training from video data, we present a
novel dataset collected for this work, with high-quality videos of youtubers
with notable expressiveness in both the speech and visual signals.Comment: ICASSP 2019. Projevct website at
https://imatge-upc.github.io/wav2pix
Bandwidth Extension on Raw Audio via Generative Adversarial Networks
Neural network-based methods have recently demonstrated state-of-the-art
results on image synthesis and super-resolution tasks, in particular by using
variants of generative adversarial networks (GANs) with supervised feature
losses. Nevertheless, previous feature loss formulations rely on the
availability of large auxiliary classifier networks, and labeled datasets that
enable such classifiers to be trained. Furthermore, there has been
comparatively little work to explore the applicability of GAN-based methods to
domains other than images and video. In this work we explore a GAN-based method
for audio processing, and develop a convolutional neural network architecture
to perform audio super-resolution. In addition to several new architectural
building blocks for audio processing, a key component of our approach is the
use of an autoencoder-based loss that enables training in the GAN framework,
with feature losses derived from unlabeled data. We explore the impact of our
architectural choices, and demonstrate significant improvements over previous
works in terms of both objective and perceptual quality
Video-to-Video Translation for Visual Speech Synthesis
Despite remarkable success in image-to-image translation that celebrates the
advancements of generative adversarial networks (GANs), very limited attempts
are known for video domain translation. We study the task of video-to-video
translation in the context of visual speech generation, where the goal is to
transform an input video of any spoken word to an output video of a different
word. This is a multi-domain translation, where each word forms a domain of
videos uttering this word. Adaptation of the state-of-the-art image-to-image
translation model (StarGAN) to this setting falls short with a large vocabulary
size. Instead we propose to use character encodings of the words and design a
novel character-based GANs architecture for video-to-video translation called
Visual Speech GAN (ViSpGAN). We are the first to demonstrate video-to-video
translation with a vocabulary of 500 words
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
Talking face generation aims to synthesize a face video with precise lip
synchronization as well as a smooth transition of facial motion over the entire
video via the given speech clip and facial image. Most existing methods mainly
focus on either disentangling the information in a single image or learning
temporal information between frames. However, cross-modality coherence between
audio and video information has not been well addressed during synthesis. In
this paper, we propose a novel arbitrary talking face generation framework by
discovering the audio-visual coherence via the proposed Asymmetric Mutual
Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA)
block by selectively focusing the lip area of the input image during the
training stage, to further enhance lip synchronization. Experimental results on
benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods
on prevalent metrics with robust high-resolution synthesizing on gender and
pose variations.Comment: IJCAI-202
Neural separation of observed and unobserved distributions
Separating mixed distributions is a long standing challenge for machine
learning and signal processing. Most current methods either rely on making
strong assumptions on the source distributions or rely on having training
samples of each source in the mixture. In this work, we introduce a new
method---Neural Egg Separation---to tackle the scenario of extracting a signal
from an unobserved distribution additively mixed with a signal from an observed
distribution. Our method iteratively learns to separate the known distribution
from progressively finer estimates of the unknown distribution. In some
settings, Neural Egg Separation is initialization sensitive, we therefore
introduce Latent Mixture Masking which ensures a good initialization. Extensive
experiments on audio and image separation tasks show that our method
outperforms current methods that use the same level of supervision, and often
achieves similar performance to full supervision.Comment: ICML'1
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
A method for statistical parametric speech synthesis incorporating generative
adversarial networks (GANs) is proposed. Although powerful deep neural networks
(DNNs) techniques can be applied to artificially synthesize speech waveform,
the synthetic speech quality is low compared with that of natural speech. One
of the issues causing the quality degradation is an over-smoothing effect often
observed in the generated speech parameters. A GAN introduced in this paper
consists of two neural networks: a discriminator to distinguish natural and
generated samples, and a generator to deceive the discriminator. In the
proposed framework incorporating the GANs, the discriminator is trained to
distinguish natural and generated speech parameters, while the acoustic models
are trained to minimize the weighted sum of the conventional minimum generation
loss and an adversarial loss for deceiving the discriminator. Since the
objective of the GANs is to minimize the divergence (i.e., distribution
difference) between the natural and generated speech parameters, the proposed
method effectively alleviates the over-smoothing effect on the generated speech
parameters. We evaluated the effectiveness for text-to-speech and voice
conversion, and found that the proposed method can generate more natural
spectral parameters and than conventional minimum generation error
training algorithm regardless its hyper-parameter settings. Furthermore, we
investigated the effect of the divergence of various GANs, and found that a
Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms
of improving synthetic speech quality.Comment: Preprint manuscript of IEEE/ACM Transactions on Audio, Speech and
Language Processin
Identity-Preserving Realistic Talking Face Generation
Speech-driven facial animation is useful for a variety of applications such
as telepresence, chatbots, etc. The necessary attributes of having a realistic
face animation are 1) audio-visual synchronization (2) identity preservation of
the target individual (3) plausible mouth movements (4) presence of natural eye
blinks. The existing methods mostly address the audio-visual lip
synchronization, and few recent works have addressed the synthesis of natural
eye blinks for overall video realism. In this paper, we propose a method for
identity-preserving realistic facial animation from speech. We first generate
person-independent facial landmarks from audio using DeepSpeech features for
invariance to different voices, accents, etc. To add realism, we impose eye
blinks on facial landmarks using unsupervised learning and retargets the
person-independent landmarks to person-specific landmarks to preserve the
identity-related facial structure which helps in the generation of plausible
mouth shapes of the target identity. Finally, we use LSGAN to generate the
facial texture from person-specific facial landmarks, using an attention
mechanism that helps to preserve identity-related texture. An extensive
comparison of our proposed method with the current state-of-the-art methods
demonstrates a significant improvement in terms of lip synchronization
accuracy, image reconstruction quality, sharpness, and identity-preservation. A
user study also reveals improved realism of our animation results over the
state-of-the-art methods. To the best of our knowledge, this is the first work
in speech-driven 2D facial animation that simultaneously addresses all the
above-mentioned attributes of a realistic speech-driven face animation.Comment: Accepted in IJCNN 202
- …