16 research outputs found
A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition
Building a good speech recognition system usually requires large amounts of
transcribed data, which is expensive to collect. To tackle this problem, many
unsupervised pre-training methods have been proposed. Among these methods,
Masked Predictive Coding achieved significant improvements on various speech
recognition datasets with BERT-like Masked Reconstruction loss and Transformer
backbone. However, many aspects of MPC have not been fully investigated. In
this paper, we conduct a further study on MPC and focus on three important
aspects: the effect of pre-training data speaking style, its extension on
streaming model, and how to better transfer learned knowledge from pre-training
stage to downstream tasks. Experiments reveled that pre-training data with
matching speaking style is more useful on downstream recognition tasks. A
unified training objective with APC and MPC provided 8.46% relative error
reduction on streaming model trained on HKUST. Also, the combination of target
data adaption and layer-wise discriminative training helped the knowledge
transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over
a strong baseline
Improved Speech Representations with Multi-Target Autoregressive Predictive Coding
Training objectives based on predictive coding have recently been shown to be
very effective at learning meaningful representations from unlabeled speech.
One example is Autoregressive Predictive Coding (Chung et al., 2019), which
trains an autoregressive RNN to generate an unseen future frame given a context
such as recent past frames. The basic hypothesis of these approaches is that
hidden states that can accurately predict future frames are a useful
representation for many downstream tasks. In this paper we extend this
hypothesis and aim to enrich the information encoded in the hidden states by
training the model to make more accurate future predictions. We propose an
auxiliary objective that serves as a regularization to improve generalization
of the future frame prediction task. Experimental results on phonetic
classification, speech recognition, and speech translation not only support the
hypothesis, but also demonstrate the effectiveness of our approach in learning
representations that contain richer phonetic content.Comment: Accepted to ACL 202
Bi-APC: Bidirectional Autoregressive Predictive Coding for Unsupervised Pre-training and Its Application to Children's ASR
We present a bidirectional unsupervised model pre-training (UPT) method and
apply it to children's automatic speech recognition (ASR). An obstacle to
improving child ASR is the scarcity of child speech databases. A common
approach to alleviate this problem is model pre-training using data from adult
speech. Pre-training can be done using supervised (SPT) or unsupervised
methods, depending on the availability of annotations. Typically, SPT performs
better. In this paper, we focus on UPT to address the situations when
pre-training data are unlabeled. Autoregressive predictive coding (APC), a UPT
method, predicts frames from only one direction, limiting its use to
uni-directional pre-training. Conventional bidirectional UPT methods, however,
predict only a small portion of frames. To extend the benefits of APC to
bi-directional pre-training, Bi-APC is proposed. We then use adaptation
techniques to transfer knowledge learned from adult speech (using the
Librispeech corpus) to child speech (OGI Kids corpus). LSTM-based hybrid
systems are investigated. For the uni-LSTM structure, APC obtains similar WER
improvements to SPT over the baseline. When applied to BLSTM, however, APC is
not as competitive as SPT, but our proposed Bi-APC has comparable improvements
to SPT.Comment: Accepted to ICASSP202
Incremental Learning for End-to-End Automatic Speech Recognition
We propose a new incremental learning for end-to-end Automatic Speech
Recognition (ASR) to extend the model's capacity on a new task while retaining
the performance on previous ones. The proposed method is effective without
accessing to the old dataset to address the issues of high retraining cost and
unavailable old dataset. To achieve this, both attention distillation and
knowledge distillation are applied to preserve the ability of the old model
during the progressive learning. With an ASR model pre-trained on 12,000h
Mandarin speech, we test our proposed method on 300h new scenario task and 1h
new named entities task. Experiments show that our method yields 3.25% and
0.88% absolute Character Error Rate (CER) reduction on the new scenario, when
compared with the pre-trained model and the full-data retraining baseline,
respectively. It even yields a surprising 0.37% absolute CER reduction on the
new scenario than the fine-tuning. For the new named entities task, our method
significantly improves the accuracy compared with the pre-trained model, i.e.
16.95% absolute CER reduction. For both of the new task adaptions, the new
models still maintain a same accuracy with the retraining baseline on the old
tasks.Comment: 5 pages, 3 figure
Exploring wav2vec 2.0 on speaker verification and language identification
Wav2vec 2.0 is a recently proposed self-supervised framework for speech
representation learning. It follows a two-stage training process of
pre-training and fine-tuning, and performs well in speech recognition tasks
especially ultra-low resource cases. In this work, we attempt to extend
self-supervised framework to speaker verification and language identification.
First, we use some preliminary experiments to indicate that wav2vec 2.0 can
capture the information about the speaker and language. Then we demonstrate the
effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker
verification, we obtain a new state-of-the-art result, Equal Error Rate (EER)
of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an
EER of 12.02% on 1 second condition and an EER of 3.47% on full-length
condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the
unified modeling by the multi-task learning for the two tasks.Comment: Self-supervised, speaker verification, language identification,
multi-task learning, wav2vec 2.
Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition
End-to-end models have achieved impressive results on the task of automatic
speech recognition (ASR). For low-resource ASR tasks, however, labeled data can
hardly satisfy the demand of end-to-end models. Self-supervised acoustic
pre-training has already shown its amazing ASR performance, while the
transcription is still inadequate for language modeling in end-to-end models.
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a
pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused
model only needs to learn the transfer from speech to language during
fine-tuning on limited labeled data. The length of the two modalities is
matched by a monotonic attention mechanism without additional parameters.
Besides, a fully connected layer is introduced for the hidden mapping between
modalities. We further propose a scheduled fine-tuning strategy to preserve and
utilize the text context modeling ability of the pre-trained linguistic
encoder. Experiments show our effective utilizing of pre-trained modules. Our
model achieves better recognition performance on CALLHOME corpus (15 hours)
than other end-to-end models
Similarity Analysis of Self-Supervised Speech Representations
Self-supervised speech representation learning has recently been a prosperous
research topic. Many algorithms have been proposed for learning useful
representations from large-scale unlabeled data, and their applications to a
wide range of speech tasks have also been investigated. However, there has been
little research focusing on understanding the properties of existing
approaches. In this work, we aim to provide a comparative study of some of the
most representative self-supervised algorithms. Specifically, we quantify the
similarities between different self-supervised representations using existing
similarity measures. We also design probing tasks to study the correlation
between the models' pre-training loss and the amount of specific speech
information contained in their learned representations. In addition to showing
how various self-supervised models behave differently given the same input, our
study also finds that the training objective has a higher impact on
representation similarity than architectural choices such as building blocks
(RNN/Transformer/CNN) and directionality (uni/bidirectional). Our results also
suggest that there exists a strong correlation between pre-training loss and
downstream performance for some self-supervised algorithms.Comment: Accepted to ICASSP 2021. Supplementary materials available at
https://github.com/iamyuanchung/ICASSP21-Similarity-Supplementar
DiDiSpeech: A Large Scale Mandarin Speech Corpus
This paper introduces a new open-sourced Mandarin speech corpus, called
DiDiSpeech. It consists of about 800 hours of speech data at 48kHz sampling
rate from 6000 speakers and the corresponding texts. All speech data in the
corpus is recorded in quiet environment and is suitable for various speech
processing tasks, such as voice conversion, multi-speaker text-to-speech and
automatic speech recognition. We conduct experiments with multiple speech tasks
and evaluate the performance, showing that it is promising to use the corpus
for both academic research and practical application. The corpus is available
at https://outreach.didichuxing.com/research/opendata/.Comment: 5 pages, 2 figures, 11 table
Embodied Self-supervised Learning by Coordinated Sampling and Training
Self-supervised learning can significantly improve the performance of
downstream tasks, however, the dimensions of learned representations normally
lack explicit physical meanings. In this work, we propose a novel
self-supervised approach to solve inverse problems by employing the
corresponding physical forward process so that the learned representations can
have explicit physical meanings. The proposed approach works in an
analysis-by-synthesis manner to learn an inference network by iteratively
sampling and training. At the sampling step, given observed data, the inference
network is used to approximate the intractable posterior, from which we sample
input parameters and feed them to a physical process to generate data in the
observational space; At the training step, the same network is optimized with
the sampled paired data. We prove the feasibility of the proposed method by
tackling the acoustic-to-articulatory inversion problem to infer articulatory
information from speech. Given an articulatory synthesizer, an inference model
can be trained completely from scratch with random initialization. Our
experiments demonstrate that the proposed method can converge steadily and the
network learns to control the articulatory synthesizer to speak like a human.
We also demonstrate that trained models can generalize well to unseen speakers
or even new languages, and performance can be further improved through
self-adaptation
Unsupervised Cross-lingual Representation Learning for Speech Recognition
This paper presents XLSR which learns cross-lingual speech representations by
pretraining a single model from the raw waveform of speech in multiple
languages. We build on wav2vec 2.0 which is trained by solving a contrastive
task over masked latent speech representations and jointly learns a
quantization of the latents shared across languages. The resulting model is
fine-tuned on labeled data and experiments show that cross-lingual pretraining
significantly outperforms monolingual pretraining. On the CommonVoice
benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared
to the best known results. On BABEL, our approach improves word error rate by
16% relative compared to a comparable system. Our approach enables a single
multilingual speech recognition model which is competitive to strong individual
models. Analysis shows that the latent discrete speech representations are
shared across languages with increased sharing for related languages. We hope
to catalyze research in low-resource speech understanding by releasing XLSR-53,
a large model pretrained in 53 languages