940 research outputs found

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Attention-Based Recurrent Autoencoder for Motion Capture Denoising

    Get PDF
    To resolve the problem of massive loss of MoCap data from optical motion capture, we propose a novel network architecture based on attention mechanism and recurrent network. Its advantage is that the use of encoder-decoder enables automatic human motion manifold learning, capturing the hidden spatial-temporal relationships in motion sequences. In addition, by using the multi-head attention mechanism, it is possible to identify the most relevant corrupted frames with specific position information to recovery the missing markers, which can lead to more accurate motion reconstruction. Simulation experiments demonstrate that the network model we proposed can effectively handle the large-scale missing markers problem with better robustness, smaller errors and more natural recovered motion sequence compared to the reference method

    CARDIAN: a novel computational approach for real-time end-diastolic frame detection in intravascular ultrasound using bidirectional attention networks

    Get PDF
    INTRODUCTION: Changes in coronary artery luminal dimensions during the cardiac cycle can impact the accurate quantification of volumetric analyses in intravascular ultrasound (IVUS) image studies. Accurate ED-frame detection is pivotal for guiding interventional decisions, optimizing therapeutic interventions, and ensuring standardized volumetric analysis in research studies. Images acquired at different phases of the cardiac cycle may also lead to inaccurate quantification of atheroma volume due to the longitudinal motion of the catheter in relation to the vessel. As IVUS images are acquired throughout the cardiac cycle, end-diastolic frames are typically identified retrospectively by human analysts to minimize motion artefacts and enable more accurate and reproducible volumetric analysis. METHODS: In this paper, a novel neural network-based approach for accurate end-diastolic frame detection in IVUS sequences is proposed, trained using electrocardiogram (ECG) signals acquired synchronously during IVUS acquisition. The framework integrates dedicated motion encoders and a bidirectional attention recurrent network (BARNet) with a temporal difference encoder to extract frame-by-frame motion features corresponding to the phases of the cardiac cycle. In addition, a spatiotemporal rotation encoder is included to capture the IVUS catheter's rotational movement with respect to the coronary artery. RESULTS: With a prediction tolerance range of 66.7 ms, the proposed approach was able to find 71.9%, 67.8%, and 69.9% of end-diastolic frames in the left anterior descending, left circumflex and right coronary arteries, respectively, when tested against ECG estimations. When the result was compared with two expert analysts’ estimation, the approach achieved a superior performance. DISCUSSION: These findings indicate that the developed methodology is accurate and fully reproducible and therefore it should be preferred over experts for end-diastolic frame detection in IVUS sequences

    Ultrasound based Silent Speech Interface using Deep Learning

    Get PDF
    Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) és una tecnologia capaç de sintetitzar veu partint únicament de senyals no-acústiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tèsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'síntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document és una extensió del treball de Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals són presentats i discutits, com les bàsiques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. També s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, així com un extens estudi d'optimització d'hyperparàmetres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs
    corecore