13 research outputs found
BUT CHiME-7 system description
This paper describes the joint effort of Brno University of Technology (BUT),
AGH University of Krakow and University of Buenos Aires on the development of
Automatic Speech Recognition systems for the CHiME-7 Challenge. We train and
evaluate various end-to-end models with several toolkits. We heavily relied on
Guided Source Separation (GSS) to convert multi-channel audio to single
channel. The ASR is leveraging speech representations from models pre-trained
by self-supervised learning, and we do a fusion of several ASR systems. In
addition, we modified external data from the LibriSpeech corpus to become a
close domain and added it to the training. Our efforts were focused on the
far-field acoustic robustness sub-track of Task 1 - Distant Automatic Speech
Recognition (DASR), our systems use oracle segmentation.Comment: 6 pages, Chime-7 challenge 202
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
MeetEval is an open-source toolkit to evaluate all kinds of meeting
transcription systems. It provides a unified interface for the computation of
commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER
along other WER definitions. We extend the cpWER computation by a temporal
constraint to ensure that only words are identified as correct when the
temporal alignment is plausible. This leads to a better quality of the matching
of the hypothesis string to the reference string that more closely resembles
the actual transcription quality, and a system is penalized if it provides poor
time annotations. Since word-level timing information is often not available,
we present a way to approximate exact word-level timings from segment-level
timings (e.g., a sentence) and show that the approximation leads to a similar
WER as a matching with exact word-level annotations. At the same time, the time
constraint leads to a speedup of the matching algorithm, which outweighs the
additional overhead caused by processing the time stamps.Comment: Accepted for presentation at the Chime7 workshop 202
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
In recent years, wsj0-2mix has become the reference dataset for
single-channel speech separation. Most deep learning-based speech separation
models today are benchmarked on it. However, recent studies have shown
important performance drops when models trained on wsj0-2mix are evaluated on
other, similar datasets. To address this generalization issue, we created
LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension,
WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker
mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we
achieve competitive performance on all LibriMix versions. In order to fairly
evaluate across datasets, we introduce a third test set based on VCTK for
speech and WHAM! for noise. Our experiments show that the generalization error
is smaller for models trained with LibriMix than with WHAM!, in both clean and
noisy conditions. Aiming towards evaluation in more realistic,
conversation-like scenarios, we also release a sparsely overlapping version of
LibriMix's test set.Comment: submitted to INTERSPEECH 202
NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization
This paper details our speaker diarization system designed for multi-domain,
multi-microphone casual conversations. The proposed diarization pipeline uses
weighted prediction error (WPE)-based dereverberation as a front end, then
applies end-to-end neural diarization with vector clustering (EEND-VC) to each
channel separately. It integrates the diarization result obtained from each
channel using diarization output voting error reduction plus overlap
(DOVER-LAP). To harness the knowledge from the target domain and results
integrated across all channels, we apply self-supervised adaptation for each
session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP.
The proposed system was incorporated into NTT's submission for the distant
automatic speech recognition task in the CHiME-7 challenge. Our system achieved
65 % and 62 % relative improvements on development and eval sets compared to
the organizer-provided VC-based baseline diarization system, securing third
place in diarization performance.Comment: 5 pages, 5 figures, Submitted to ICASSP 202
The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios
The CHiME challenges have played a significant role in the development and
evaluation of robust automatic speech recognition (ASR) systems. We introduce
the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task
comprises joint ASR and diarization in far-field settings with multiple, and
possibly heterogeneous, recording devices. Different from previous challenges,
we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The
goal is for participants to devise a single system that can generalize across
different array geometries and use cases with no a-priori information. Another
departure from earlier CHiME iterations is that participants are allowed to use
open-source pre-trained models and datasets. In this paper, we describe the
challenge design, motivation, and fundamental research questions in detail. We
also present the baseline system, which is fully array-topology agnostic and
features multi-channel diarization, channel selection, guided source separation
and a robust ASR model that leverages self-supervised speech representations
(SSLR)
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we
organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
The new challenge revisits the previous CHiME-5 challenge and further considers
the problem of distant multi-microphone conversational speech diarization and
recognition in everyday home environments. Speech material is the same as the
previous CHiME-5 recordings except for accurate array synchronization. The
material was elicited using a dinner party scenario with efforts taken to
capture data that is representative of natural conversational speech. This
paper provides a baseline description of the CHiME-6 challenge for both
segmented multispeaker speech recognition (Track 1) and unsegmented
multispeaker speech recognition (Track 2). Of note, Track 2 is the first
challenge activity in the community to tackle an unsegmented multispeaker
speech recognition scenario with a complete set of reproducible open source
baselines providing speech enhancement, speaker diarization, and speech
recognition modules
LibriMix: An open-source dataset for generalizable speech separation
In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two-or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set