Search CORE

13 research outputs found

BUT CHiME-7 system description

Author: Barchi Germán
Beneš Karel
Karafiát Martin
Mošner Ladislav
Pepino Leonardo
Szöke Igor
Veselý Karel
Witkowski Marcin
Publication venue
Publication date: 18/10/2023
Field of study

This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. We train and evaluate various end-to-end models with several toolkits. We heavily relied on Guided Source Separation (GSS) to convert multi-channel audio to single channel. The ASR is leveraging speech representations from models pre-trained by self-supervised learning, and we do a fusion of several ASR systems. In addition, we modified external data from the LibriSpeech corpus to become a close domain and added it to the training. Our efforts were focused on the far-field acoustic robustness sub-track of Task 1 - Distant Automatic Speech Recognition (DASR), our systems use oracle segmentation.Comment: 6 pages, Chime-7 challenge 202

arXiv.org e-Print Archive

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

Author: Boeddeker Christoph
Delcroix Marc
Haeb-Umbach Reinhold
von Neumann Thilo
Publication venue
Publication date: 21/07/2023
Field of study

MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.Comment: Accepted for presentation at the Chime7 workshop 202

arXiv.org e-Print Archive

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Author: Cornell Samuele
Cosentino Joris
Deleforge Antoine
Pariente Manuel
Vincent Emmanuel
Publication venue
Publication date: 22/05/2020
Field of study

In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set.Comment: submitted to INTERSPEECH 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

Author: Ando Atsushi
Delcroix Marc
Ogawa Atsunori
Tawara Naohiro
Publication venue
Publication date: 22/09/2023
Field of study

This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.Comment: 5 pages, 5 figures, Submitted to ICASSP 202

arXiv.org e-Print Archive

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

Author: Chang Xuankai
Cornell Samuele
Garcia Paola
Khudanpur Sanjeev
Maciejewski Matthew
Masuyama Yoshiki
Raj Desh
Squartini Stefano
Wang Zhong-Qiu
Watanabe Shinji
Wiesner Matthew
Publication venue
Publication date: 14/07/2023
Field of study

The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The goal is for participants to devise a single system that can generalize across different array geometries and use cases with no a-priori information. Another departure from earlier CHiME iterations is that participants are allowed to use open-source pre-trained models and datasets. In this paper, we describe the challenge design, motivation, and fundamental research questions in detail. We also present the baseline system, which is fully array-topology agnostic and features multi-channel diarization, channel selection, guided source separation and a robust ASR model that leverages self-supervised speech representations (SSLR)

arXiv.org e-Print Archive

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Author: Arora Ashish
Barker Jon
Boeddeker Christoph
Chang Xuankai
Fujita Yusuke
Horiguchi Shota
Kanda Naoyuki
Khudanpur Sanjeev
Mandel Michael
Manohar Vimal
Ni Zhaoheng
Povey Daniel
Raj Desh
Ryant Neville
Snyder David
Subramanian Aswin Shanmugam
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Yair Bar Ben
Yoshioka Takuya
Publication venue
Publication date: 02/05/2020
Field of study

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

LibriMix: An open-source dataset for generalizable speech separation

Author: Cornell Samuele
Cosentino Joris
Deleforge Antoine
Pariente Manuel
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 22/05/2020
Field of study

In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two-or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set

INRIA a CCSD electronic archive server