18 research outputs found
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
Self-supervised representation learning (SSRL) has improved the performance
on downstream phoneme recognition versus supervised models. Training SSRL
models requires a large amount of pre-training data and this poses a challenge
for low resource languages. A common approach is transferring knowledge from
other languages. Instead, we propose to use audio augmentation to pre-train
SSRL models in a low resource condition and evaluate phoneme recognition as
downstream task. We performed a systematic comparison of augmentation
techniques, namely: pitch variation, noise addition, accented target-language
speech and other language speech. We found combined augmentations (noise/pitch)
was the best augmentation strategy outperforming accent and language knowledge
transfer. We compared the performance with various quantities and types of
pre-training data. We examined the scaling factor of augmented data to achieve
equivalent performance to models pre-trained with target domain speech. Our
findings suggest that for resource constrained languages, in-domain synthetic
augmentation can outperform knowledge transfer from accented or other language
speech.Comment: 5 pages, 4 figures, ICASSP2
Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora
Self-supervised speech models have grown fast during the past few years and
have proven feasible for use in various downstream tasks. Some recent work has
started to look at the characteristics of these models, yet many concerns have
not been fully addressed. In this work, we conduct a study on emotional corpora
to explore a popular self-supervised model -- wav2vec 2.0. Via a set of
quantitative analysis, we mainly demonstrate that: 1) wav2vec 2.0 appears to
discard paralinguistic information that is less useful for word recognition
purposes; 2) for emotion recognition, representations from the middle layer
alone perform as well as those derived from layer averaging, while the final
layer results in the worst performance in some cases; 3) current
self-supervised models may not be the optimal solution for downstream tasks
that make use of non-lexical features. Our work provides novel findings that
will aid future research in this area and theoretical basis for the use of
existing models.Comment: Accepted to SLT 202
MelHuBERT: A simplified HuBERT on Mel spectrograms
Self-supervised models have had great success in learning speech
representations that can generalize to various downstream tasks. However, most
self-supervised models require a large amount of compute and multiple GPUs to
train, significantly hampering the development of self-supervised learning. In
an attempt to reduce the computation of training, we revisit the training of
HuBERT, a highly successful self-supervised model. We improve and simplify
several key components, including the loss function, input representation, and
training in multiple stages. Our model, MelHuBERT, is able to achieve favorable
performance on phone recognition, speaker identification, and automatic speech
recognition against HuBERT, while saving 31.2% of the pre-training time, or
equivalently 33.5% MACs per one second speech. The code and pre-trained models
are available in https://github.com/nervjack2/MelHuBERT.Comment: ASRU 202
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
In this paper, we introduce self-distillation and online clustering for
self-supervised speech representation learning (DinoSR) which combines masked
language modeling, self-distillation, and online clustering. We show that these
concepts complement each other and result in a strong representation learning
model for speech. DinoSR first extracts contextualized embeddings from the
input audio with a teacher network, then runs an online clustering system on
the embeddings to yield a machine-discovered phone inventory, and finally uses
the discretized tokens to guide a student network. We show that DinoSR
surpasses previous state-of-the-art performance in several downstream tasks,
and provide a detailed analysis of the model and the learned discrete units
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
How to boost speech pre-training with textual data is an unsolved problem due
to the fact that speech and text are very different modalities with distinct
characteristics. In this paper, we propose a cross-modal Speech and Language
Model (SpeechLM) to explicitly align speech and text pre-training with a
pre-defined unified discrete representation. Specifically, we introduce two
alternative discrete tokenizers to bridge the speech and text modalities,
including phoneme-unit and hidden-unit tokenizers, which can be trained using a
small amount of paired speech-text data. Based on the trained tokenizers, we
convert the unlabeled speech and text data into tokens of phoneme units or
hidden units. The pre-training objective is designed to unify the speech and
the text into the same discrete semantic space with a unified Transformer
network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative
WER reduction over the best base model performance (from 6.8 to 5.7) on the
public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even
outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also
evaluate our SpeechLM on various spoken language processing tasks under the
universal representation evaluation framework SUPERB, demonstrating significant
improvements on content-related tasks. Our code and models are available at
https://aka.ms/SpeechLM.Comment: 14 page
Speech Separation based on Contrastive Learning and Deep Modularization
The current monaural state of the art tools for speech separation relies on
supervised learning. This means that they must deal with permutation problem,
they are impacted by the mismatch on the number of speakers used in training
and inference. Moreover, their performance heavily relies on the presence of
high-quality labelled data. These problems can be effectively addressed by
employing a fully unsupervised technique for speech separation. In this paper,
we use contrastive learning to establish the representations of frames then use
the learned representations in the downstream deep modularization task.
Concretely, we demonstrate experimentally that in speech separation, different
frames of a speaker can be viewed as augmentations of a given hidden standard
frame of that speaker. The frames of a speaker contain enough prosodic
information overlap which is key in speech separation. Based on this, we
implement a self-supervised learning to learn to minimize the distance between
frames belonging to a given speaker. The learned representations are used in a
downstream deep modularization task to cluster frames based on speaker
identity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix
shows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively
in WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7
respectively in WSJ0-2mix. Its greatest strength being that as the number of
speakers increase, its performance does not degrade significantly.Comment: arXiv admin note: substantial text overlap with arXiv:2212.0036