5 research outputs found

    Articulatory features for speech-driven head motion synthesis

    Get PDF
    This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy that have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features gave higher correlations with the original head motion than when only prosodic features are used. Index Terms: head motion synthesis, articulatory features, canonical correlation analysis, acoustic-to-articulatory mappin

    A speaker adaptive DNN training approach for speaker-independent acoustic inversion

    Get PDF
    We address the speaker-independent acoustic inversion (AI) problem, also referred to as acoustic-to-articulatory mapping. The scarce availability of multi-speaker articulatory data makes it difficult to learn a mapping which generalizes from a limited number of training speakers and reliably reconstructs the articulatory movements of unseen speakers. In this paper, we propose a Multi-task Learning (MTL)-based approach that explicitly separates the modeling of each training speaker AI peculiarities from the modeling of AI characteristics that are shared by all speakers. Our approach stems from the well known Regularized MTL approach and extends it to feed-forward deep neural networks (DNNs). Given multiple training speakers, we learn for each an acoustic-to-articulatory mapping represented by a DNN. Then, through an iterative procedure, we search for a canonical speaker-independent DNN that is "similar" to all speaker-dependent DNNs. The degree of similarity is controlled by a regularization parameter. We report experiments on the University of Wisconsin X-ray Microbeam Database under different training/testing experimental settings. The results obtained indicate that our MTL-trained canonical DNN largely outperforms a standardly trained (i.e., single task learning-based) speaker independent DNN

    Head Motion Analysis and Synthesis over Different Tasks

    Get PDF
    Abstract. It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data

    Head motion synthesis: evaluation and a template motion approach

    Get PDF
    The use of conversational agents has increased across the world. From providing automated support for companies to being virtual psychologists they have moved from an academic curiosity to an application with real world relevance. While many researchers have focused on the content of the dialogue and synthetic speech to give the agents a voice, more recently animating these characters has become a topic of interest. An additional use for character animation technology is in the film and video game industry where having characters animated without needing to pay for expensive labour would save tremendous costs. When animating characters there are many aspects to consider, for example the way they walk. However, to truly assist with communication automated animation needs to duplicate the body language used when speaking. In particular conversational agents are often only an animation of the upper parts of the body, so head motion is one of the keys to a believable agent. While certain linguistic features are obvious, such as nodding to indicate agreement, research has shown that head motion also aids understanding of speech. Additionally head motion often contains emotional cues, prosodic information, and other paralinguistic information. In this thesis we will present our research into synthesising head motion using only recorded speech as input. During this research we collected a large dataset of head motion synchronised with speech, examined evaluation methodology, and developed a synthesis system. Our dataset is one of the larger ones available. From it we present some statistics about head motion in general. Including differences between read speech and story telling speech, and differences between speakers. From this we are able to draw some conclusions as to what type of source data will be the most interesting in head motion research, and if speaker-dependent models are needed for synthesis. In our examination of head motion evaluation methodology we introduce Forced Canonical Correlation Analysis (FCCA). FCCA shows the difference between head motion shaped noise and motion capture better than standard methods for objective evaluation used in the literature. We have shown that for subjective testing it is best practice to use a variation of MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) based testing, adapted for head motion. Through experimentation we have developed guidelines for the implementation of the test, and the constraints on the length. Finally we present a new system for head motion synthesis. We make use of simple templates of motion, automatically extracted from source data, that are warped to suit the speech features. Our system uses clustering to pick the small motion units, and a combined HMM and GMM based approach for determining the values of warping parameters at synthesis time. This results in highly natural looking motion that outperforms other state of the art systems. Our system requires minimal human intervention and produces believable motion. The key innovates were the new methods for segmenting head motion and creating a process similar to language modelling for synthesising head motion
    corecore