3 research outputs found
Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition
The audio-video based emotion recognition aims to classify a given video into
basic emotions. In this paper, we describe our approaches in EmotiW 2019, which
mainly explores emotion features and feature fusion strategies for audio and
visual modality. For emotion features, we explore audio feature with both
speech-spectrogram and Log Mel-spectrogram and evaluate several facial features
with different CNN models and different emotion pretrained strategies. For
fusion strategies, we explore intra-modal and cross-modal fusion methods, such
as designing attention mechanisms to highlights important emotion feature,
exploring feature concatenation and factorized bilinear pooling (FBP) for
cross-modal feature fusion. With careful evaluation, we obtain 65.5% on the
AFEW validation set and 62.48% on the test set and rank third in the challenge.Comment: Accepted by ACM ICMI'19 (2019 International Conference on Multimodal
Interaction
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis
In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Multimedia Computin
The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results
In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.Accepted author manuscriptMultimedia Computin