230,521 research outputs found
Stress and Emotion Classification Using Jitter and Shimmer Features
In this paper, we evaluate the use of appended jitter and shimmer speech features for the classification of human speaking styles and of animal vocalization arousal levels. Jitter and shimmer features are extracted from the fundamental frequency contour and added to baseline spectral features, specifically Mel-frequency cepstral coefficients (MFCCs) for human speech and Greenwood function cepstral coefficients (GFCCs) for animal vocalizations. Hidden Markov models (HMMs) with Gaussian mixture models (GMMs) state distributions are used for classification. The appended jitter and shimmer features result in an increase in classification accuracy for several illustrative datasets, including the SUSAS dataset for human speaking styles as well as vocalizations labeled by arousal level for African elephant and Rhesus monkey species
Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control
Different people have different facial expressions while speaking
emotionally. A realistic facial animation system should consider such
identity-specific speaking styles and facial idiosyncrasies to achieve
high-degree of naturalness and plausibility. Existing approaches to
personalized speech-driven 3D facial animation either use one-hot identity
labels or rely-on person specific models which limit their scalability. We
present a personalized speech-driven expressive 3D facial animation synthesis
framework that models identity specific facial motion as latent representations
(called as styles), and synthesizes novel animations given a speech input with
the target style for various emotion categories. Our framework is trained in an
end-to-end fashion and has a non-autoregressive encoder-decoder architecture
with three main components: expression encoder, speech encoder and expression
decoder. Since, expressive facial motion includes both identity-specific style
and speech-related content information; expression encoder first disentangles
facial motion sequences into style and content representations, respectively.
Then, both of the speech encoder and the expression decoders input the
extracted style information to update transformer layer weights during training
phase. Our speech encoder also extracts speech phoneme label and duration
information to achieve better synchrony within the non-autoregressive synthesis
mechanism more effectively. Through detailed experiments, we demonstrate that
our approach produces temporally coherent facial expressions from input speech
while preserving the speaking styles of the target identities.Comment: 8 page
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings
The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.Comment: 5 pages, 2 figures, 2 tables, submitted to ICASSP 202
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
In addition to conveying the linguistic content from source speech to
converted speech, maintaining the speaking style of source speech also plays an
important role in the voice conversion (VC) task, which is essential in many
scenarios with highly expressive source speech, such as dubbing and data
augmentation. Previous work generally took explicit prosodic features or
fixed-length style embedding extracted from source speech to model the speaking
style of source speech, which is insufficient to achieve comprehensive style
modeling and target speaker timbre preservation. Inspired by the style's
multi-scale nature of human speech, a multi-scale style modeling method for the
VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the
speaking style of source speech from different levels. To effectively convey
the speaking style and meanwhile prevent timbre leakage from source speech to
converted speech, each level's style is modeled by specific representation.
Specifically, prosodic features, pre-trained ASR model's bottleneck features,
and features extracted by a model trained with a self-supervised strategy are
adopted to model the frame, local, and global-level styles, respectively.
Besides, to balance the performance of source style modeling and target speaker
timbre preservation, an explicit constraint module consisting of a pre-trained
speech emotion recognition model and a speaker classifier is introduced to
MSM-VC. This explicit constraint module also makes it possible to simulate the
style transfer inference process during the training to improve the
disentanglement ability and alleviate the mismatch between training and
inference. Experiments performed on the highly expressive speech corpus
demonstrate that MSM-VC is superior to the state-of-the-art VC methods for
modeling source speech style while maintaining good speech quality and speaker
similarity.Comment: This work was submitted on April 10, 2022 and accepted on August 29,
202
- …