Search CORE

11,960 research outputs found

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

Author: Cheng Ming
Li Ming
Publication venue
Publication date: 29/02/2024
Field of study

Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker's acoustic footprint to distinguish his or her personal speech activities, which is easily affected by overlapped speech in multi-speaker scenarios. Although visual information naturally tolerates overlapped speech, it suffers from spatial occlusion, low resolution, etc. The potential modality-missing problem blocks TS-VAD towards an audio-visual approach. This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice Activity Detection (MIMO-TSVAD) framework for speaker diarization. The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework. Experimental results show that the MIMO-TSVAD framework demonstrates state-of-the-art performance on the VoxConverse, DIHARD-III, and MISP 2022 datasets under corresponding evaluation metrics, obtaining the Diarization Error Rates (DERs) of 4.18%, 10.10%, and 8.15%, respectively. In addition, it can perform robustly in heavy lip-missing scenarios.Comment: Under review of IEEE/ACM Transactions on Audio, Speech, and Language Processin

arXiv.org e-Print Archive

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Author: Busso Carlos
Tao Fei
Publication venue
Publication date: 12/09/2018
Field of study

Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

arXiv.org e-Print Archive

The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements

Author: Baird Alice
Schuller Björn
Schumann Lea
Stappen Lukas
Publication venue
Publication date: 20/10/2021
Field of study

Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible `in-the-wild' properties makes large datasets such as these indispensable with respect to building robust machine learning models. A sufficient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of all modalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integrating the audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation, including annotation tiers not used in this year's MuSe 2020. In addition, for one of the sub-challenges - predicting the level of trustworthiness - no participant outperformed the baseline model, and so we propose a simple, but highly efficient Multi-Head-Attention network that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 % improvement).Comment: accepted versio

arXiv.org e-Print Archive

OPUS Augsburg

Short review A "voice patch" system in the primate brain for processing vocal information?

Author: Aglieri Virginia
Belin Pascal,
Publication venue: 'Elsevier BV'
Publication date: 07/05/2018
Field of study

International audienceWe review behavioural and neural evidence for the processing of information contained in conspecific vocalizations (CVs) in three primate species: humans, macaques and marmosets. We focus on abilities that are present and ecologically relevant in all three species: the detection and sensitivity to CVs; and the processing of identity cues in CVs. Current evidence, although fragmentary, supports the notion of a "voice patch system" in the primate brain analogous to the face patch system of visual cortex: a series of discrete, interconnected cortical areas supporting increasingly abstract representations of the vocal input. A central question concerns the degree to which the voice patch system is conserved in evolution. We outline challenges that arise and suggesting potential avenues for comparing the organization of the voice patch system across primate brains

HAL AMU

HAL Descartes

Understanding public speakers’ performance: first contributions to support a computational approach

Author: A Cullen
A Zhen
B Martinez
DR Carney
E Insafutdinov
F Iani
J Holler
J Kreitewolf
L Chen
M Koppensteiner
N Sadoughi
P Boersma
QF Gronau
SJ Vick
T Özseven
TJ Park
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Communication is part of our everyday life and our ability to communicate can have a significant role in a variety of contexts in our personal, academic, and professional lives. For long, the characterization of what is a good communicator has been subject to research and debate by several areas, particularly in Education, with a focus on improving the performance of teachers. In this context, the literature suggests that the ability to communicate is not only defined by the verbal component, but also by a plethora of non-verbal contributions providing redundant or complementary information, and, sometimes, being the message itself. However, even though we can recognize a good or bad communicator, objectively, little is known about what aspects – and to what extent—define the quality of a presentation. The goal of this work is to create the grounds to support the study of the defining characteristics of a good communicator in a more systematic and objective form. To this end, we conceptualize and provide a first prototype for a computational approach to characterize the different elements that are involved in communication, from audiovisual data, illustrating the outcomes and applicability of the proposed methods on a video database of public speakers.publishe

Crossref

Repositório Institucional da Universidade de Aveiro