75 research outputs found

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Temporal contextual descriptors and applications to emotion analysis.

    Get PDF
    The current trends in technology suggest that the next generation of services and devices allows smarter customization and automatic context recognition. Computers learn the behavior of the users and can offer them customized services depending on the context, location, and preferences. One of the most important challenges in human-machine interaction is the proper understanding of human emotions by machines and automated systems. In the recent years, the progress made in machine learning and pattern recognition led to the development of algorithms that are able to learn the detection and identification of human emotions from experience. These algorithms use different modalities such as image, speech, and physiological signals to analyze and learn human emotions. In many settings, the vocal information might be more available than other modalities due to widespread of voice sensors in phones, cars, and computer systems in general. In emotion analysis from speech, an audio utterance is represented by an ordered (in time) sequence of features or a multivariate time series. Typically, the sequence is further mapped into a global descriptor representative of the entire utterance/sequence. This descriptor is used for classification and analysis. In classic approaches, statistics are computed over the entire sequence and used as a global descriptor. This often results in the loss of temporal ordering from the original sequence. Emotion is a succession of acoustic events. By discarding the temporal ordering of these events in the mapping, the classic approaches cannot detect acoustic patterns that lead to a certain emotion. In this dissertation, we propose a novel feature mapping framework. The proposed framework maps temporally ordered sequence of acoustic features into data-driven global descriptors that integrate the temporal information from the original sequence. The framework contains three mapping algorithms. These algorithms integrate the temporal information implicitly and explicitly in the descriptor\u27s representation. In the rst algorithm, the Temporal Averaging Algorithm, we average the data temporally using leaky integrators to produce a global descriptor that implicitly integrates the temporal information from the original sequence. In order to integrate the discrimination between classes in the mapping, we propose the Temporal Response Averaging Algorithm which combines the temporal averaging step of the previous algorithm and unsupervised learning to produce data driven temporal contextual descriptors. In the third algorithm, we use the topology preserving property of the Self-Organizing Maps and the continuous nature of speech to map a temporal sequence into an ordered trajectory representing the behavior over time of the input utterance on a 2-D map of emotions. The temporal information is integrated explicitly in the descriptor which makes it easier to monitor emotions in long speeches. The proposed mapping framework maps speech data of different length to the same equivalent representation which alleviates the problem of dealing with variable length temporal sequences. This is advantageous in real time setting where the size of the analysis window can be variable. Using the proposed feature mapping framework, we build a novel data-driven speech emotion detection and recognition system that indexes speech databases to facilitate the classification and retrieval of emotions. We test the proposed system using two datasets. The first corpus is acted. We showed that the proposed mapping framework outperforms the classic approaches while providing descriptors that are suitable for the analysis and visualization of humansā€™ emotions in speech data. The second corpus is an authentic dataset. In this dissertation, we evaluate the performances of our system using a collection of debates. For that purpose, we propose a novel debate collection that is one of the first initiatives in the literature. We show that the proposed system is able to learn human emotions from debates

    Feature extraction based on bio-inspired model for robust emotion recognition

    Get PDF
    Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones CientĆ­ficas y TĆ©cnicas. Centro CientĆ­fico TecnolĆ³gico Conicet - Santa Fe. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĆ­a y Ciencias HĆ­dricas. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones CientĆ­ficas y TĆ©cnicas. Centro CientĆ­fico TecnolĆ³gico Conicet - Santa Fe. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĆ­a y Ciencias HĆ­dricas. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones CientĆ­ficas y TĆ©cnicas. Centro CientĆ­fico TecnolĆ³gico Conicet - Santa Fe. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĆ­a y Ciencias HĆ­dricas. Instituto de InvestigaciĆ³n en SeƱales, Sistemas e Inteligencia Computacional; Argentin

    Cross validation of bi-modal health-related stress assessment

    Get PDF
    This study explores the feasibility of objective and ubiquitous stress assessment. 25 post-traumatic stress disorder patients participated in a controlled storytelling (ST) study and an ecologically valid reliving (RL) study. The two studies were meant to represent an early and a late therapy session, and each consisted of a "happy" and a "stress triggering" part. Two instruments were chosen to assess the stress level of the patients at various point in time during therapy: (i) speech, used as an objective and ubiquitous stress indicator and (ii) the subjective unit of distress (SUD), a clinically validated Likert scale. In total, 13 statistical parameters were derived from each of five speech features: amplitude, zero-crossings, power, high-frequency power, and pitch. To model the emotional state of the patients, 28 parameters were selected from this set by means of a linear regression model and, subsequently, compressed into 11 principal components. The SUD and speech model were cross-validated, using 3 machine learning algorithms. Between 90% (2 SUD levels) and 39% (10 SUD levels) correct classification was achieved. The two sessions could be discriminated in 89% (for ST) and 77% (for RL) of the cases. This report fills a gap between laboratory and clinical studies, and its results emphasize the usefulness of Computer Aided Diagnostics (CAD) for mental health care

    Novel Fourier Quadrature Transforms and Analytic Signal Representations for Nonlinear and Non-stationary Time Series Analysis

    Full text link
    The Hilbert transform (HT) and associated Gabor analytic signal (GAS) representation are well-known and widely used mathematical formulations for modeling and analysis of signals in various applications. In this study, like the HT, to obtain quadrature component of a signal, we propose the novel discrete Fourier cosine quadrature transforms (FCQTs) and discrete Fourier sine quadrature transforms (FSQTs), designated as Fourier quadrature transforms (FQTs). Using these FQTs, we propose sixteen Fourier-Singh analytic signal (FSAS) representations with following properties: (1) real part of eight FSAS representations is the original signal and imaginary part is the FCQT of the real part, (2) imaginary part of eight FSAS representations is the original signal and real part is the FSQT of the real part, (3) like the GAS, Fourier spectrum of the all FSAS representations has only positive frequencies, however unlike the GAS, the real and imaginary parts of the proposed FSAS representations are not orthogonal to each other. The Fourier decomposition method (FDM) is an adaptive data analysis approach to decompose a signal into a set of small number of Fourier intrinsic band functions which are AM-FM components. This study also proposes a new formulation of the FDM using the discrete cosine transform (DCT) with the GAS and FSAS representations, and demonstrate its efficacy for improved time-frequency-energy representation and analysis of nonlinear and non-stationary time series.Comment: 22 pages, 13 figure

    Ubiquitous emotion-aware computing

    Get PDF
    Emotions are a crucial element for personal and ubiquitous computing. What to sense and how to sense it, however, remain a challenge. This study explores the rare combination of speech, electrocardiogram, and a revised Self-Assessment Mannequin to assess peopleā€™s emotions. 40 people watched 30 International Affective Picture System pictures in either an office or a living-room environment. Additionally, their personality traits neuroticism and extroversion and demographic information (i.e., gender, nationality, and level of education) were recorded. The resulting data were analyzed using both basic emotion categories and the valence--arousal model, which enabled a comparison between both representations. The combination of heart rate variability and three speech measures (i.e., variability of the fundamental frequency of pitch (F0), intensity, and energy) explained 90% (p < .001) of the participantsā€™ experienced valence--arousal, with 88% for valence and 99% for arousal (ps < .001). The six basic emotions could also be discriminated (p < .001), although the explained variance was much lower: 18ā€“20%. Environment (or context), the personality trait neuroticism, and gender proved to be useful when a nuanced assessment of peopleā€™s emotions was needed. Taken together, this study provides a significant leap toward robust, generic, and ubiquitous emotion-aware computing
    • ā€¦
    corecore