2,529 research outputs found
Audio-to-Visual Speech Conversion using Deep Neural Networks
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results
Parallel and Limited Data Voice Conversion Using Stochastic Variational Deep Kernel Learning
Typically, voice conversion is regarded as an engineering problem with
limited training data. The reliance on massive amounts of data hinders the
practical applicability of deep learning approaches, which have been
extensively researched in recent years. On the other hand, statistical methods
are effective with limited data but have difficulties in modelling complex
mapping functions. This paper proposes a voice conversion method that works
with limited data and is based on stochastic variational deep kernel learning
(SVDKL). At the same time, SVDKL enables the use of deep neural networks'
expressive capability as well as the high flexibility of the Gaussian process
as a Bayesian and non-parametric method. When the conventional kernel is
combined with the deep neural network, it is possible to estimate non-smooth
and more complex functions. Furthermore, the model's sparse variational
Gaussian process solves the scalability problem and, unlike the exact Gaussian
process, allows for the learning of a global mapping function for the entire
acoustic space. One of the most important aspects of the proposed scheme is
that the model parameters are trained using marginal likelihood optimization,
which considers both data fitting and model complexity. Considering the
complexity of the model reduces the amount of training data by increasing the
resistance to overfitting. To evaluate the proposed scheme, we examined the
model's performance with approximately 80 seconds of training data. The results
indicated that our method obtained a higher mean opinion score, smaller
spectral distortion, and better preference tests than the compared methods
- …