9 research outputs found
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
Contrastive Predictive Coding (CPC), based on predicting future segments of
speech based on past segments is emerging as a powerful algorithm for
representation learning of speech signal. However, it still under-performs
other methods on unsupervised evaluation benchmarks. Here, we introduce
WavAugment, a time-domain data augmentation library and find that applying
augmentation in the past is generally more efficient and yields better
performances than other methods. We find that a combination of pitch
modification, additive noise and reverberation substantially increase the
performance of CPC (relative improvement of 18-22%), beating the reference
Libri-light results with 600 times less data. Using an out-of-domain dataset,
time-domain data augmentation can push CPC to be on par with the state of the
art on the Zero Speech Benchmark 2017. We also show that time-domain data
augmentation consistently improves downstream limited-supervision phoneme
classification tasks by a factor of 12-15% relative
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
International audienceContrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative
Unsupervised Contrastive Learning of Sound Event Representations
Self-supervised representation learning can mitigate the limitations in
recognition tasks with few manually labeled data but abundant unlabeled
data---a common scenario in sound event research. In this work, we explore
unsupervised contrastive learning as a way to learn sound event
representations. To this end, we propose to use the pretext task of contrasting
differently augmented views of sound events. The views are computed primarily
via mixing of training examples with unrelated backgrounds, followed by other
data augmentations. We analyze the main components of our method via ablation
experiments. We evaluate the learned representations using linear evaluation,
and in two in-domain downstream sound event classification tasks, namely, using
limited manually labeled data, and using noisy labeled data. Our results
suggest that unsupervised contrastive pre-training can mitigate the impact of
data scarcity and increase robustness against noisy labels, outperforming
supervised baselines.Comment: A 4-page version is submitted to ICASSP 202
End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021
This paper describes the submission to the IWSLT 2021 offline speech
translation task by the UPC Machine Translation group. The task consists of
building a system capable of translating English audio recordings extracted
from TED talks into German text. Submitted systems can be either cascade or
end-to-end and use a custom or given segmentation. Our submission is an
end-to-end speech translation system, which combines pre-trained models
(Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder,
and uses an efficient fine-tuning technique, which trains only 20% of its total
parameters. We show that adding an Adapter to the system and pre-training it,
can increase the convergence speed and the final result, with which we achieve
a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble
that obtains 28.22 BLEU score on the same set. Our submission also uses a
custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for
identifying periods of untranscribable text and can bring improvements of 2.5
to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the
given segmentation.Comment: Submitted to IWSLT 2021; changed the title and added submission
result
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K
hours of unlabelled speech data in 23 languages. It is the largest open data to
date for unsupervised representation learning as well as semi-supervised
learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16
languages and their aligned oral interpretations into 5 other languages
totaling 5.1K hours. We provide speech recognition baselines and validate the
versatility of VoxPopuli unlabelled data in semi-supervised learning under
challenging out-of-domain settings. We will release the corpus at
https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper
Speech Separation based on Contrastive Learning and Deep Modularization
The current monaural state of the art tools for speech separation relies on
supervised learning. This means that they must deal with permutation problem,
they are impacted by the mismatch on the number of speakers used in training
and inference. Moreover, their performance heavily relies on the presence of
high-quality labelled data. These problems can be effectively addressed by
employing a fully unsupervised technique for speech separation. In this paper,
we use contrastive learning to establish the representations of frames then use
the learned representations in the downstream deep modularization task.
Concretely, we demonstrate experimentally that in speech separation, different
frames of a speaker can be viewed as augmentations of a given hidden standard
frame of that speaker. The frames of a speaker contain enough prosodic
information overlap which is key in speech separation. Based on this, we
implement a self-supervised learning to learn to minimize the distance between
frames belonging to a given speaker. The learned representations are used in a
downstream deep modularization task to cluster frames based on speaker
identity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix
shows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively
in WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7
respectively in WSJ0-2mix. Its greatest strength being that as the number of
speakers increase, its performance does not degrade significantly.Comment: arXiv admin note: substantial text overlap with arXiv:2212.0036