Search CORE

17 research outputs found

Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset

Author: Shao Yiwen
Publication venue
Publication date: 05/10/2023
Field of study

Multi-channel multi-talker speech recognition presents formidable challenges in the realm of speech processing, marked by issues such as background noise, reverberation, and overlapping speech. Overcoming these complexities requires leveraging contextual cues to separate target speech from a cacophonous mix, enabling accurate recognition. Among these cues, the 3D spatial feature has emerged as a cutting-edge solution, particularly when equipped with spatial information about the target speaker. Its exceptional ability to discern the target speaker within mixed audio, often rendering intermediate processing redundant, paves the way for the direct training of "All-in-one" ASR models. These models have demonstrated commendable performance on both simulated and real-world data. In this paper, we extend this approach to the MISP dataset to further validate its efficacy. We delve into the challenges encountered and insights gained when applying 3D spatial features to MISP, while also exploring preliminary experiments involving the replacement of these features with more complex input and models

arXiv.org e-Print Archive

The Speed Submission to DIHARD II: Contributions & Lessons Learned

Author: Barras Claude
Bredin Hervé
Brutti Alessio
Cornell Samuele
Evans Nicholas
Korshunov Pavel
Marcel Sébastien
Patino Jose
Sahidullah Md
Serizel Romain
Sivasankaran Sunit
Squartini Stefano
Vincent Emmanuel
Yin Ruiqing
Publication venue: HAL CCSD
Publication date: 01/01/2019
Field of study

This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge

INRIA a CCSD electronic archive server