Search CORE

645 research outputs found

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Author: Cornell Samuele
Cosentino Joris
Deleforge Antoine
Pariente Manuel
Vincent Emmanuel
Publication venue
Publication date: 22/05/2020
Field of study

In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set.Comment: submitted to INTERSPEECH 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Author: Barker Jon
Doddipatla Rama
Zhang Jisi
Zorila Catalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/11/2020
Field of study

This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.Comment: Presented at IEEE ICASSP 202

arXiv.org e-Print Archive

Crossref

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

Author: Boeddeker Christoph
Delcroix Marc
Drude Lukas
Haeb-Umbach Reinhold
Kinoshita Keisuke
Nakatani Tomohiro
von Neumann Thilo
Publication venue: 'International Speech Communication Association'
Publication date: 21/12/2020
Field of study

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.Comment: 5 pages, INTERSPEECH 202

arXiv.org e-Print Archive

Crossref

LibriMix: An open-source dataset for generalizable speech separation

Author: Cornell Samuele
Cosentino Joris
Deleforge Antoine
Pariente Manuel
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 22/05/2020
Field of study

In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two-or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set

INRIA a CCSD electronic archive server

Grounding semantics in robots for Visual Question Answering

Author: Wahle Björn
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC