Search CORE

9 research outputs found

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Author: Douze Matthijs
Dupoux Emmanuel
Kharitonov Eugene
Mazaré Pierre-Emmanuel
Rivière Morgane
Synnaeve Gabriel
Wolf Lior
Publication venue
Publication date: 02/07/2020
Field of study

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Author: Douze Matthijs
Dupoux Emmanuel
Kharitonov Eugene
Mazaré Pierre-Emmanuel
Rivière Morgane
Synnaeve Gabriel
Wolf Lior
Publication venue: HAL CCSD
Publication date: 13/12/2020
Field of study

International audienceContrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative

INRIA a CCSD electronic archive server

Unsupervised Contrastive Learning of Sound Event Representations

Author: Fonseca Eduardo
McGuinness Kevin
O'Connor Noel E.
Ortego Diego
Serra Xavier
Publication venue
Publication date: 15/11/2020
Field of study

Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data---a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound events. The views are computed primarily via mixing of training examples with unrelated backgrounds, followed by other data augmentations. We analyze the main components of our method via ablation experiments. We evaluate the learned representations using linear evaluation, and in two in-domain downstream sound event classification tasks, namely, using limited manually labeled data, and using noisy labeled data. Our results suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels, outperforming supervised baselines.Comment: A 4-page version is submitted to ICASSP 202

arXiv.org e-Print Archive

DCU Online Research Access Service

UPF Digital Repository

End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

Author: Costa-jussà Marta R.
Escolano Carlos
Fonollosa José A. R.
Gállego Gerard I.
Tsiamas Ioannis
Publication venue
Publication date: 01/01/2021
Field of study

This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique, which trains only 20% of its total parameters. We show that adding an Adapter to the system and pre-training it, can increase the convergence speed and the final result, with which we achieve a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble that obtains 28.22 BLEU score on the same set. Our submission also uses a custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for identifying periods of untranscribable text and can bring improvements of 2.5 to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the given segmentation.Comment: Submitted to IWSLT 2021; changed the title and added submission result

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Author: Dupoux Emmanuel
Haziza Daniel
Lee Ann
Pino Juan
Rivière Morgane
Talnikar Chaitanya
Wang Changhan
Williamson Mary
Wu Anne
Publication venue
Publication date: 27/07/2021
Field of study

We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Speech Separation based on Contrastive Learning and Deep Modularization

Author: Ochieng Peter
Publication venue
Publication date: 08/08/2023
Field of study

The current monaural state of the art tools for speech separation relies on supervised learning. This means that they must deal with permutation problem, they are impacted by the mismatch on the number of speakers used in training and inference. Moreover, their performance heavily relies on the presence of high-quality labelled data. These problems can be effectively addressed by employing a fully unsupervised technique for speech separation. In this paper, we use contrastive learning to establish the representations of frames then use the learned representations in the downstream deep modularization task. Concretely, we demonstrate experimentally that in speech separation, different frames of a speaker can be viewed as augmentations of a given hidden standard frame of that speaker. The frames of a speaker contain enough prosodic information overlap which is key in speech separation. Based on this, we implement a self-supervised learning to learn to minimize the distance between frames belonging to a given speaker. The learned representations are used in a downstream deep modularization task to cluster frames based on speaker identity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix shows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively in WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7 respectively in WSJ0-2mix. Its greatest strength being that as the number of speakers increase, its performance does not degrade significantly.Comment: arXiv admin note: substantial text overlap with arXiv:2212.0036

arXiv.org e-Print Archive