5 research outputs found
Articulatory features for speech-driven head motion synthesis
This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy that have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features gave higher correlations with the original head motion than when only prosodic features are used. Index Terms: head motion synthesis, articulatory features, canonical correlation analysis, acoustic-to-articulatory mappin
A speaker adaptive DNN training approach for speaker-independent acoustic inversion
We address the speaker-independent acoustic inversion (AI) problem, also referred to as acoustic-to-articulatory mapping. The scarce availability of multi-speaker articulatory data makes it difficult to learn a mapping which generalizes from a limited number of training speakers and reliably reconstructs the articulatory movements of unseen speakers. In this paper, we propose a Multi-task Learning (MTL)-based approach that explicitly separates the modeling of each training speaker AI peculiarities from the modeling of AI characteristics that are shared by all speakers. Our approach stems from the well known Regularized MTL approach and extends it to feed-forward deep neural networks (DNNs). Given multiple training speakers, we learn for each an acoustic-to-articulatory mapping represented by a DNN. Then, through an iterative procedure, we search for a canonical speaker-independent DNN that is "similar" to all speaker-dependent DNNs. The degree of similarity is controlled by a regularization parameter. We report experiments on the University of Wisconsin X-ray Microbeam Database under different training/testing experimental settings. The results obtained indicate that our MTL-trained canonical DNN largely outperforms a standardly trained (i.e., single task learning-based) speaker independent DNN
Head Motion Analysis and Synthesis over Different Tasks
Abstract. It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data
Head motion synthesis: evaluation and a template motion approach
The use of conversational agents has increased across the world. From providing automated
support for companies to being virtual psychologists they have moved from
an academic curiosity to an application with real world relevance. While many researchers
have focused on the content of the dialogue and synthetic speech to give the
agents a voice, more recently animating these characters has become a topic of interest.
An additional use for character animation technology is in the film and video game industry
where having characters animated without needing to pay for expensive labour
would save tremendous costs.
When animating characters there are many aspects to consider, for example the way
they walk. However, to truly assist with communication automated animation needs to
duplicate the body language used when speaking. In particular conversational agents
are often only an animation of the upper parts of the body, so head motion is one of
the keys to a believable agent. While certain linguistic features are obvious, such as
nodding to indicate agreement, research has shown that head motion also aids understanding
of speech. Additionally head motion often contains emotional cues, prosodic
information, and other paralinguistic information.
In this thesis we will present our research into synthesising head motion using only
recorded speech as input. During this research we collected a large dataset of head
motion synchronised with speech, examined evaluation methodology, and developed a
synthesis system.
Our dataset is one of the larger ones available. From it we present some statistics
about head motion in general. Including differences between read speech and story
telling speech, and differences between speakers. From this we are able to draw some
conclusions as to what type of source data will be the most interesting in head motion
research, and if speaker-dependent models are needed for synthesis.
In our examination of head motion evaluation methodology we introduce Forced Canonical
Correlation Analysis (FCCA). FCCA shows the difference between head motion
shaped noise and motion capture better than standard methods for objective evaluation
used in the literature. We have shown that for subjective testing it is best practice to
use a variation of MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA)
based testing, adapted for head motion. Through experimentation we have developed
guidelines for the implementation of the test, and the constraints on the length.
Finally we present a new system for head motion synthesis. We make use of simple
templates of motion, automatically extracted from source data, that are warped to
suit the speech features. Our system uses clustering to pick the small motion units,
and a combined HMM and GMM based approach for determining the values of warping
parameters at synthesis time. This results in highly natural looking motion that
outperforms other state of the art systems. Our system requires minimal human intervention
and produces believable motion. The key innovates were the new methods
for segmenting head motion and creating a process similar to language modelling for
synthesising head motion