Search CORE

5 research outputs found

End-to-End Audiovisual Fusion with LSTMs

Author: Li Zuwei
Pantic Maja
Petridis Stavros
Wang Yujiang
Publication venue
Publication date: 12/09/2017
Field of study

Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-the-art performance on the AVIC database leading to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audio-only model for all views when the acoustic noise is high.Comment: Accepted to AVSP 2017. arXiv admin note: substantial text overlap with arXiv:1709.00443 and text overlap with arXiv:1701.0584

arXiv.org e-Print Archive

Crossref

Affective state level recognition in naturalistic facial and vocal expressions

Author: Bianchi-Berthouz N
Meng H
Publication venue
Publication date: 23/04/2013
Field of study

Naturalistic affective expressions change at a rate much slower than the typical rate at which video or audio is recorded. This increases the probability that consecutive recorded instants of expressions represent the same affective content. In this paper, we exploit such a relationship to improve the recognition performance of continuous naturalistic affective expressions. Using datasets of naturalistic affective expressions (AVEC 2011 audio and video dataset, PAINFUL video dataset) continuously labeled over time and over different dimensions, we analyze the transitions between levels of those dimensions (e.g., transitions in pain intensity level). We use an information theory approach to show that the transitions occur very slowly and hence suggest modeling them as first-order Markov models. The dimension levels are considered to be the hidden states in the Hidden Markov Model (HMM) framework. Their discrete transition and emission matrices are trained by using the labels provided with the training set. The recognition problem is converted into a best path-finding problem to obtain the best hidden states sequence in HMMs. This is a key difference from previous use of HMMs as classifiers. Modeling of the transitions between dimension levels is integrated in a multistage approach, where the first level performs a mapping between the affective expression features and a soft decision value (e.g., an affective dimension level), and further classification stages are modeled as HMMs that refine that mapping by taking into account the temporal relationships between the output decision labels. The experimental results for each of the unimodal datasets show overall performance to be significantly above that of a standard classification system that does not take into account temporal relationships. In particular, the results on the AVEC 2011 audio dataset outperform all other systems presented at the international competition

Crossref

UCL Discovery

Brunel University Research Archive

Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

Author: B. Schuller (17933135)
F. Eyben (17933138)
Georgios Tzimiropoulos (17169037)
M. Pantic (17190181)
S. Petridis (17933141)
S. Zafeiriou (17190184)
Publication venue
Publication date: 08/02/2024
Field of study

We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features. Ã‚Â© 2011 IEEE.</p

FigShare

Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

Author: Eyben Florian
Pantic Maja
Petridis Stavros
Schuller Björn
Tzimiropoulos Georgios
Zafeiriou Stefanos
Publication venue: IEEE Signal Processing Society
Publication date: 01/01/2011
Field of study

University of Lincoln Institutional Repository

University of Twente Research Information