2,058 research outputs found

    Speech-driven Animation with Meaningful Behaviors

    Full text link
    Conversational agents (CAs) play an important role in human computer interaction. Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, (2) initializes the state configuration models increasing the range of the generated behaviors, and (3) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table

    Audio-to-Visual Speech Conversion using Deep Neural Networks

    Get PDF
    We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results

    Prosody-Based Adaptive Metaphoric Head and Arm Gestures Synthesis in Human Robot Interaction

    Get PDF
    International audienceIn human-human interaction, the process of communication can be established through three modalities: verbal, non-verbal (i.e., gestures), and/or para-verbal (i.e., prosody). The linguistic literature shows that the para-verbal and non-verbal cues are naturally aligned and synchronized, however the natural mechanism of this synchronization is still unexplored. The difficulty encountered during the coordination between prosody and metaphoric head-arm gestures concerns the conveyed meaning , the way of performing gestures with respect to prosodic characteristics, their relative temporal arrangement, and their coordinated organization in the phrasal structure of utterance. In this research, we focus on the mechanism of mapping between head-arm gestures and speech prosodic characteristics in order to generate an adaptive robot behavior to the interacting human's emotional state. Prosody patterns and the motion curves of head-arm gestures are aligned separately into parallel Hidden Markov Models (HMM). The mapping between speech and head-arm gestures is based on the Coupled Hidden Markov Models (CHMM), which could be seen as a multi-stream collection of HMM, characterizing the segmented prosody and head-arm gestures' data. An emotional state based audio-video database has been created for the validation of this study. The obtained results show the effectiveness of the proposed methodology

    Footprints of information foragers: Behaviour semantics of visual exploration

    Get PDF
    Social navigation exploits the knowledge and experience of peer users of information resources. A wide variety of visual–spatial approaches become increasingly popular as a means to optimize information access as well as to foster and sustain a virtual community among geographically distributed users. An information landscape is among the most appealing design options of representing and communicating the essence of distributed information resources to users. A fundamental and challenging issue is how an information landscape can be designed such that it will not only preserve the essence of the underlying information structure, but also accommodate the diversity of individual users. The majority of research in social navigation has been focusing on how to extract useful information from what is in common between users' profiles, their interests and preferences. In this article, we explore the role of modelling sequential behaviour patterns of users in augmenting social navigation in thematic landscapes. In particular, we compare and analyse the trails of individual users in thematic spaces along with their cognitive ability measures. We are interested in whether such trails can provide useful guidance for social navigation if they are embedded in a visual–spatial environment. Furthermore, we are interested in whether such information can help users to learn from each other, for example, from the ones who have been successful in retrieving documents. In this article, we first describe how users' trails in sessions of an experimental study of visual information retrieval can be characterized by Hidden Markov Models. Trails of users with the most successful retrieval performance are used to estimate parameters of such models. Optimal virtual trails generated from the models are visualized and animated as if they were actual trails of individual users in order to highlight behavioural patterns that may foster social navigation. The findings of the research will provide direct input to the design of social navigation systems as well as to enrich theories of social navigation in a wider context. These findings will lead to the further development and consolidation of a tightly coupled paradigm of spatial, semantic and social navigation

    Fault diagnosis of rolling element bearing based on wavelet kernel principle component analysis-coupled hidden Markov model

    Get PDF
    Different description results will be obtained when apply hidden Markov model (HMM) to the two different channel signals from the same data collection point respectively. Besides, wrong fault diagnosis result might be obtained because fault feature information would not be described comprehensively by using only one single channel signal. In theory, two channel signals collected form the same data collection point will contain much more fault information than the single channel signal contain, but the coupled phenomenon might occur between the two channel signals. Coupled hidden Markov model (CHMM) is the improved method of HMM and it can fuse the information of two channel signals from the same data collection point efficiently, so much more reliable diagnosis result could be obtained by using CHMM than by using HMM. Stated thus, the fault diagnosis method of rolling element bearing based on wavelet kernel component analysis (WKPCA)-CHMM is proposed: Firstly, use WKPCA as fault feature vectors extraction method to increase the efficiency of the proposed method. Then apply CHMM to the extracted fault feature vectors and satisfactory fault diagnosis result is obtained at last. The feasibility and advantages of the proposed method are verified through experiment

    Making Faces - State-Space Models Applied to Multi-Modal Signal Processing

    Get PDF

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Multichannel dynamic modeling of non-Gaussian mixtures

    Full text link
    [EN] This paper presents a novel method that combines coupled hidden Markov models (HMM) and non Gaussian mixture models based on independent component analyzer mixture models (ICAMM). The proposed method models the joint behavior of a number of synchronized sequential independent component analyzer mixture models (SICAMM), thus we have named it generalized SICAMM (G-SICAMM). The generalization allows for flexible estimation of complex data densities, subspace classification, blind source separation, and accurate modeling of both local and global dynamic interactions. In this work, the structured result obtained by G-SICAMM was used in two ways: classification and interpretation. Classification performance was tested on an extensive number of simulations and a set of real electroencephalograms (EEG) from epileptic patients performing neuropsychological tests. G-SICAMM outperformed the following competitive methods: Gaussian mixture models, HMM, Coupled HMM, ICAMM, SICAMM, and a long short-term memory (LSTM) recurrent neural network. As for interpretation, the structured result returned by G-SICAMM on EEGs was mapped back onto the scalp, providing a set of brain activations. These activations were consistent with the physiological areas activated during the tests, thus proving the ability of the method to deal with different kind of data densities and changing non-stationary and non-linear brain dynamics. (C) 2019 Elsevier Ltd. All rights reserved.This work was supported by Spanish Administration (Ministerio de Economia y Competitividad) and European Union (FEDER) under grants TEC2014-58438-R and TEC2017-84743-P.Safont Armero, G.; Salazar Afanador, A.; Vergara Domínguez, L.; Gomez, E.; Villanueva, V. (2019). Multichannel dynamic modeling of non-Gaussian mixtures. Pattern Recognition. 93:312-323. https://doi.org/10.1016/j.patcog.2019.04.022S3123239
    • …
    corecore