7,638 research outputs found
Listening while Speaking: Speech Chain by Deep Learning
Despite the close relationship between speech perception and production,
research in automatic speech recognition (ASR) and text-to-speech synthesis
(TTS) has progressed more or less independently without exerting much mutual
influence on each other. In human communication, on the other hand, a
closed-loop speech chain mechanism with auditory feedback from the speaker's
mouth to her ear is crucial. In this paper, we take a step further and develop
a closed-loop speech chain model based on deep learning. The
sequence-to-sequence model in close-loop architecture allows us to train our
model on the concatenation of both labeled and unlabeled data. While ASR
transcribes the unlabeled speech features, TTS attempts to reconstruct the
original speech waveform based on the text from ASR. In the opposite direction,
ASR also attempts to reconstruct the original text transcription given the
synthesized speech. To the best of our knowledge, this is the first deep
learning model that integrates human speech perception and production
behaviors. Our experimental results show that the proposed approach
significantly improved the performance more than separate systems that were
only trained with labeled data
Incremental Machine Speech Chain Towards Enabling Listening while Speaking in Real-time
Inspired by a human speech chain mechanism, a machine speech chain framework
based on deep learning was recently proposed for the semi-supervised
development of automatic speech recognition (ASR) and text-to-speech synthesis
TTS) systems. However, the mechanism to listen while speaking can be done only
after receiving entire input sequences. Thus, there is a significant delay when
encountering long utterances. By contrast, humans can listen to what hey speak
in real-time, and if there is a delay in hearing, they won't be able to
continue speaking. In this work, we propose an incremental machine speech chain
towards enabling machine to listen while speaking in real-time. Specifically,
we construct incremental ASR (ISR) and incremental TTS (ITTS) by letting both
systems improve together through a short-term loop. Our experimental results
reveal that our proposed framework is able to reduce delays due to long
utterances while keeping a comparable performance to the non-incremental basic
machine speech chain.Comment: Accepted in INTERSPEECH 202
Neural Speech Synthesis with Transformer Network
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves the training efficiency. Meanwhile,
any two inputs at different times are connected directly by self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS)
Face Video Generation from a Single Image and Landmarks
In this paper we are concerned with the challenging problem of producing a
full image sequence of a deformable face given only an image and generic facial
motions encoded by a set of sparse landmarks. To this end we build upon recent
breakthroughs in image-to-image translation such as pix2pix, CycleGAN and
StarGAN which learn Deep Convolutional Neural Networks (DCNNs) that learn to
map aligned pairs or images between different domains (i.e., having different
labels) and propose a new architecture which is not driven any more by labels
but by spatial maps, facial landmarks. In particular, we propose the MotionGAN
which transforms an input face image into a new one according to a heatmap of
target landmarks. We show that it is possible to create very realistic face
videos using a single image and a set of target landmarks. Furthermore, our
method can be used to edit a facial image with arbitrary motions according to
landmarks (e.g., expression, speech, etc.). This provides much more flexibility
to face editing, expression transfer, facial video creation, etc. than models
based on discrete expressions, audios or action units
Optimization under Uncertainty in the Era of Big Data and Deep Learning: When Machine Learning Meets Mathematical Programming
This paper reviews recent advances in the field of optimization under
uncertainty via a modern data lens, highlights key research challenges and
promise of data-driven optimization that organically integrates machine
learning and mathematical programming for decision-making under uncertainty,
and identifies potential research opportunities. A brief review of classical
mathematical programming techniques for hedging against uncertainty is first
presented, along with their wide spectrum of applications in Process Systems
Engineering. A comprehensive review and classification of the relevant
publications on data-driven distributionally robust optimization, data-driven
chance constrained program, data-driven robust optimization, and data-driven
scenario-based optimization is then presented. This paper also identifies
fertile avenues for future research that focuses on a closed-loop data-driven
optimization framework, which allows the feedback from mathematical programming
to machine learning, as well as scenario-based optimization leveraging the
power of deep learning techniques. Perspectives on online learning-based
data-driven multistage optimization with a learning-while-optimizing scheme is
presented
Aerospace Medicine and Biology. A continuing bibliography with indexes
This bibliography lists 244 reports, articles, and other documents introduced into the NASA scientific and technical information system in February 1981. Aerospace medicine and aerobiology topics are included. Listings for physiological factors, astronaut performance, control theory, artificial intelligence, and cybernetics are included
Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot Skills
Personal robots assisting humans must perform complex manipulation tasks that
are typically difficult to specify in traditional motion planning pipelines,
where multiple objectives must be met and the high-level context be taken into
consideration. Learning from demonstration (LfD) provides a promising way to
learn these kind of complex manipulation skills even from non-technical users.
However, it is challenging for existing LfD methods to efficiently learn skills
that can generalize to task specifications that are not covered by
demonstrations. In this paper, we introduce a state transition model (STM) that
generates joint-space trajectories by imitating motions from expert behavior.
Given a few demonstrations, we show in real robot experiments that the learned
STM can quickly generalize to unseen tasks and synthesize motions having longer
time horizons than the expert trajectories. Compared to conventional motion
planners, our approach enables the robot to accomplish complex behaviors from
high-level instructions without laborious hand-engineering of planning
objectives, while being able to adapt to changing goals during the skill
execution. In conjunction with a trajectory optimizer, our STM can construct a
high-quality skeleton of a trajectory that can be further improved in
smoothness and precision. In combination with a learned inverse dynamics model,
we additionally present results where the STM is used as a high-level planner.
A video of our experiments is available at https://youtu.be/85DX9Ojq-90Comment: Submitted to IROS 201
Analysis-by-synthesis by learning to invert generative black boxes
Abstract. For learning meaningful representations of data, a rich source of prior knowledge may come in the form of a generative black box, e.g. a graphics program that generates realistic facial images. We consider the problem of learning the inverse of a given generative model from data. The problem is non-trivial because it is difficult to create labelled training cases by hand, and the generative mapping is a black box in the sense that there is no analytic expression for its gradient. We describe a way of training a feedforward neural network that starts with just one labelled training example and uses the generative black box to “breed” more training data. As learning proceeds, the training set evolves and the labels that the network assigns to unlabelled training data converge to their correct values. We demonstrate our approach by learning to invert a generative model of eyes and an active appearance model of faces.
DiffWave: A Versatile Diffusion Model for Audio Synthesis
In this work, we propose DiffWave, a versatile diffusion probabilistic model
for conditional and unconditional waveform generation. The model is
non-autoregressive, and converts the white noise signal into structured
waveform through a Markov chain with a constant number of steps at synthesis.
It is efficiently trained by optimizing a variant of variational bound on the
data likelihood. DiffWave produces high-fidelity audios in different waveform
generation tasks, including neural vocoding conditioned on mel spectrogram,
class-conditional generation, and unconditional generation. We demonstrate that
DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44
versus 4.43), while synthesizing orders of magnitude faster. In particular, it
significantly outperforms autoregressive and GAN-based waveform models in the
challenging unconditional generation task in terms of audio quality and sample
diversity from various automatic and human evaluations.Comment: ICLR 2021 (oral
Maybe Deep Neural Networks are the Best Choice for Modeling Source Code
Statistical language modeling techniques have successfully been applied to
source code, yielding a variety of new software development tools, such as
tools for code suggestion and improving readability. A major issue with these
techniques is that code introduces new vocabulary at a far higher rate than
natural language, as new identifier names proliferate. But traditional language
models limit the vocabulary to a fixed set of common words. For code, this
strong assumption has been shown to have a significant negative effect on
predictive performance. But the open vocabulary version of the neural network
language models for code have not been introduced in the literature. We present
a new open-vocabulary neural language model for code that is not limited to a
fixed vocabulary of identifier names. We employ a segmentation into subword
units, subsequences of tokens chosen based on a compression criterion,
following previous work in machine translation. Our network achieves best in
class performance, outperforming even the state-of-the-art methods of
Hellendoorn and Devanbu that are designed specifically to model code.
Furthermore, we present a simple method for dynamically adapting the model to a
new test project, resulting in increased performance. We showcase our
methodology on code corpora in three different languages of over a billion
tokens each, hundreds of times larger than in previous work. To our knowledge,
this is the largest neural language model for code that has been reported
- …