2,152 research outputs found
Transformer based unsupervised pre-training for acoustic representation learning
Recently, a variety of acoustic tasks and related applications arised. For
many acoustic tasks, the labeled data size may be limited. To handle this
problem, we propose an unsupervised pre-training method using Transformer based
encoder to learn a general and robust high-level representation for all
acoustic tasks. Experiments have been conducted on three kinds of acoustic
tasks: speech emotion recognition, sound event detection and speech
translation. All the experiments have shown that pre-training using its own
training data can significantly improve the performance. With a larger
pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech
emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP
dataset. For sound event detection, the F1 score can further improve absolutely
1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech
translation, the BLEU score can further improve relatively 12.2% on En-De
dataset and 8.4% on En-Fr dataset.Comment: Accepted by ICASSP 202
Effectiveness of self-supervised pre-training for speech recognition
We compare self-supervised representation learning algorithms which either
explicitly quantize the audio data or learn representations without
quantization. We find the former to be more accurate since it builds a good
vocabulary of the data through vq-wav2vec [1] to enable learning of effective
representations in subsequent BERT training. Different to previous work, we
directly fine-tune the pre-trained BERT models on transcribed speech using a
Connectionist Temporal Classification (CTC) loss instead of feeding the
representations into a task-specific model. We also propose a BERT-style model
learning directly from the continuous audio data and compare pre-training on
raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled
Librispeech data with a vq-wav2vec vocabulary is almost as good as the best
known reported system trained on 100 hours of labeled data on testclean, while
achieving a 25% WER reduction on test-other. When using only 10 minutes of
labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This
demonstrates that self-supervision can enable speech recognition systems
trained on a near-zero amount of transcribed data
A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition
Building a good speech recognition system usually requires large amounts of
transcribed data, which is expensive to collect. To tackle this problem, many
unsupervised pre-training methods have been proposed. Among these methods,
Masked Predictive Coding achieved significant improvements on various speech
recognition datasets with BERT-like Masked Reconstruction loss and Transformer
backbone. However, many aspects of MPC have not been fully investigated. In
this paper, we conduct a further study on MPC and focus on three important
aspects: the effect of pre-training data speaking style, its extension on
streaming model, and how to better transfer learned knowledge from pre-training
stage to downstream tasks. Experiments reveled that pre-training data with
matching speaking style is more useful on downstream recognition tasks. A
unified training objective with APC and MPC provided 8.46% relative error
reduction on streaming model trained on HKUST. Also, the combination of target
data adaption and layer-wise discriminative training helped the knowledge
transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over
a strong baseline
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation
For self-supervised speech processing, it is crucial to use pretrained models
as speech representation extractors. In recent works, increasing the size of
the model has been utilized in acoustic model training in order to achieve
better performance. In this paper, we propose Audio ALBERT, a lite version of
the self-supervised speech representation model. We use the representations
with two downstream tasks, speaker identification, and phoneme classification.
We show that Audio ALBERT is capable of achieving competitive performance with
those huge models in the downstream tasks while utilizing 91\% fewer
parameters. Moreover, we use some simple probing models to measure how much the
information of the speaker and phoneme is encoded in latent representations. In
probing experiments, we find that the latent representations encode richer
information of both phoneme and speaker than that of the last layer.Comment: Accepted by IEEE Spoken Language Technology Workshop 202
wav2vec: Unsupervised Pre-training for Speech Recognition
We explore unsupervised pre-training for speech recognition by learning
representations of raw audio. wav2vec is trained on large amounts of unlabeled
audio data and the resulting representations are then used to improve acoustic
model training. We pre-train a simple multi-layer convolutional neural network
optimized via a noise contrastive binary classification task. Our experiments
on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up
to 36% when only a few hours of transcribed data is available. Our approach
achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the
best reported character-based system in the literature while using two orders
of magnitude less labeled training data
Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization
The Sparsespeech model is an unsupervised acoustic model that can generate
discrete pseudo-labels for untranscribed speech. We extend the Sparsespeech
model to allow for sampling over a random discrete variable, yielding
pseudo-posteriorgrams. The degree of sparsity in this posteriorgram can be
fully controlled after the model has been trained. We use the Gumbel-Softmax
trick to approximately sample from a discrete distribution in the neural
network and this allows us to train the network efficiently with standard
backpropagation. The new and improved model is trained and evaluated on the
Libri-Light corpus, a benchmark for ASR with limited or no supervision. The
model is trained on 600h and 6000h of English read speech. We evaluate the
improved model using the ABX error measure and a semi-supervised setting with
10h of transcribed speech. We observe a relative improvement of up to 31.4% on
ABX error rates across speakers on the test set with the improved Sparsespeech
model on 600h of speech data and further improvements when we scale the model
to 6000h
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Self-supervised approaches for speech representation learning are challenged
by three unique problems: (1) there are multiple sound units in each input
utterance, (2) there is no lexicon of input sound units during the pre-training
phase, and (3) sound units have variable lengths with no explicit segmentation.
To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT)
approach for self-supervised speech representation learning, which utilizes an
offline clustering step to provide aligned target labels for a BERT-like
prediction loss. A key ingredient of our approach is applying the prediction
loss over the masked regions only, which forces the model to learn a combined
acoustic and language model over the continuous inputs. HuBERT relies primarily
on the consistency of the unsupervised clustering step rather than the
intrinsic quality of the assigned cluster labels. Starting with a simple
k-means teacher of 100 clusters, and using two iterations of clustering, the
HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0
performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with
10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model,
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging
dev-other and test-other evaluation subsets
Masked Pre-trained Encoder base on Joint CTC-Transformer
This study (The work was accomplished during the internship in Tencent AI
lab) addresses semi-supervised acoustic modeling, i.e. attaining high-level
representations from unsupervised audio data and fine-tuning the parameters of
pre-trained model with supervised data. The proposed approach adopts a
two-stage training framework, consisting of masked pre-trained encoder (MPE)
and Joint CTC-Transformer (JCT). In the MPE framework, part of input frames are
masked and reconstructed after the encoder with massive unsupervised data. In
JCT framework, compared with original Transformer, acoustic features are
applied as input instead of plain text. CTC loss performs as the prediction
target on top of the encoder, and decoder blocks remain unchanged. This paper
presents a comparison between two-stage training method and the fully
supervised JCT. In addition, this paper investigates the our approach's
robustness against different volumns of training data. Experiments on the
two-stage training method deliver much better performance than fully supervised
model. The word error rate (WER) with two-stage training which only exploits
30\% of WSJ labeled data achieves 17\% reduction than which trained by 50\% of
WSJ in a fully supervised way. Moreover, increasing unlabeled data for MPE from
WSJ (81h) to Librispeech (960h) attains about 22\% WER reduction
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
We introduce a self-supervised speech pre-training method called TERA, which
stands for Transformer Encoder Representations from Alteration. Recent
approaches often learn through the formulation of a single auxiliary task like
contrastive prediction, autoregressive prediction, or masked reconstruction.
Unlike previous approaches, we use a multi-target auxiliary task to pre-train
Transformer Encoders on a large amount of unlabeled speech. The model learns
through the reconstruction of acoustic frames from its altered counterpart,
where we use a stochastic policy to alter along three dimensions: temporal,
channel, and magnitude. TERA can be used to extract speech representations or
fine-tune with downstream models. We evaluate TERA on several downstream tasks,
including phoneme classification, speaker recognition, and speech recognition.
TERA achieved strong performance on these tasks by improving upon surface
features and outperforming previous methods. In our experiments, we show that
through alteration along different dimensions, the model learns to encode
distinct aspects of speech. We explore different knowledge transfer methods to
incorporate the pre-trained model with downstream models. Furthermore, we show
that the proposed method can be easily transferred to another dataset not used
in pre-training.Comment: Submitted to IEEE/ACM TASLP, under revie
- …