108 research outputs found

    Articulatory features for speech-driven head motion synthesis

    Get PDF
    This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy that have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features gave higher correlations with the original head motion than when only prosodic features are used. Index Terms: head motion synthesis, articulatory features, canonical correlation analysis, acoustic-to-articulatory mappin

    The University of Edinburgh Head-Motion and Audio Storytelling (UoE-HAS) Dataset

    Get PDF
    Abstract. In this paper we announce the release of a large dataset of storytelling monologue with motion capture for the head and body. Initial tests on the dataset indicate that head motion is more dependant on the speaker than the style of speech

    Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder

    Get PDF
    This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We show that, rather than combining different features that originate from waveforms, it is more effective to use waveforms directly predicting corresponding head motion. The challenge with the waveform-based approach is that waveforms contain a large amount of information irrelevant to predict head motion, which hinders the training of neural networks. To overcome the problem, we propose a canonical-correlation-constrained autoencoder (CCCAE), where hidden layers are trained to not only minimise the error but also maximise the canonical correlation with head motion. Compared with an MFCC-based system, the proposed system shows comparable performance in objective evaluation, and better performance in subject evaluation.Comment: head motion synthesis, speech-driven animation, deep canonically correlated autoencode

    Speech-driven head motion generation from waveforms

    Get PDF
    Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, WavCCCAE, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the WavCCCAE-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the WavCCCAEfeature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where WavCCCAEwas chosen to be better than the other models by the participants of MUSHRA test

    Double-DCCCAE: Estimation of body gestures from speech waveform

    Get PDF

    Robust Pitch Detection by Narrow Band Spectrum Analysis

    Get PDF
    This paper proposes a new technique for detecting pitch patterns which is useful for automatic speech recognition, by using a narrow band spectrum analysis. The motivation of this approach is that humans perceive some kind of pitch in whispers where no fundamental frequencies can be observed, while most of the pitch determination algorithm (PDA) fails to detect such perceptual pitch. The narrow band spectrum analysis enable us to find pitch structure distributed locally in frequency domain. Incorporating this technique into PDA's is realized to applying the technique to the lag window based PDA. Experimental results show that pitch detection performance could be improved by 4% for voiced sounds and 8% for voiceless sounds

    Accent Phrase Segmentation by Finding N-Best Sequences of Pitch Pattern Templates

    Get PDF
    This paper describes a prosodic method for segmenting continuous speech into accent phrases. Optimum sequences are obtained on the basis of least squared error criterion by using dynamic time warping between F0 contours of input speech and reference accent patterns called 'pitch pattern templates'. But the optimum sequence does not always give good agreement with phrase boundaries labeled by hand, while the second or the third optimum candidate sequence does well. Therefore, we expand our system to be able to find out multiple candidates by using N-best algorithm. Evaluation tests were carried out using the ATR continuous speech database of 10 speakers. The results showed about 97% of phrase boundaries were correctly detected when we took 30-best candidates, and this accuracy is 7.5% higher than the conventional method without using N-best search algorithm
    corecore