85,669 research outputs found
Cycle-consistency training for end-to-end speech recognition
This paper presents a method to train end-to-end automatic speech recognition
(ASR) models using unpaired data. Although the end-to-end approach can
eliminate the need for expert knowledge such as pronunciation dictionaries to
build ASR systems, it still requires a large amount of paired data, i.e.,
speech utterances and their transcriptions. Cycle-consistency losses have been
recently proposed as a way to mitigate the problem of limited paired data.
These approaches compose a reverse operation with a given transformation, e.g.,
text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised
data, speech in this example. Applying cycle consistency to ASR models is not
trivial since fundamental information, such as speaker traits, are lost in the
intermediate text bottleneck. To solve this problem, this work presents a loss
that is based on the speech encoder state sequence instead of the raw speech
signal. This is achieved by training a Text-To-Encoder model and defining a
loss based on the encoder reconstruction error. Experimental results on the
LibriSpeech corpus show that the proposed cycle-consistency training reduced
the word error rate by 14.7% from an initial model trained with 100-hour paired
data, using an additional 360 hours of audio data without transcriptions. We
also investigate the use of text-only data mainly for language modeling to
further improve the performance in the unpaired data training scenario.Comment: Submitted to ICASSP'1
Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
Sequence-to-sequence automatic speech recognition (ASR) models require large
quantities of data to attain high performance. For this reason, there has been
a recent surge in interest for unsupervised and semi-supervised training in
such models. This work builds upon recent results showing notable improvements
in semi-supervised training using cycle-consistency and related techniques.
Such techniques derive training procedures and losses able to leverage unpaired
speech and/or text data by combining ASR with Text-to-Speech (TTS) models. In
particular, this work proposes a new semi-supervised loss combining an
end-to-end differentiable ASRTTS loss with TTSASR
loss. The method is able to leverage both unpaired speech and text data to
outperform recently proposed related techniques in terms of \%WER. We provide
extensive results analyzing the impact of data quantity and speech and text
modalities and show consistent gains across WSJ and Librispeech corpora. Our
code is provided in ESPnet to reproduce the experiments.Comment: INTERSPEECH 201
From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings
Producing a large amount of annotated speech data for training ASR systems
remains difficult for more than 95% of languages all over the world which are
low-resourced. However, we note human babies start to learn the language by the
sounds (or phonetic structures) of a small number of exemplar words, and
"generalize" such knowledge to other words without hearing a large amount of
data. We initiate some preliminary work in this direction. Audio Word2Vec is
used to learn the phonetic structures from spoken words (signal segments),
while another autoencoder is used to learn the phonetic structures from text
words. The relationships among the above two can be learned jointly, or
separately after the above two are well trained. This relationship can be used
in speech recognition with very low resource. In the initial experiments on the
TIMIT dataset, only 2.1 hours of speech data (in which 2500 spoken words were
annotated and the rest unlabeled) gave a word error rate of 44.6%, and this
number can be reduced to 34.2% if 4.1 hr of speech data (in which 20000 spoken
words were annotated) were given. These results are not satisfactory, but a
good starting point
Augmented Cyclic Adversarial Learning for Low Resource Domain Adaptation
Training a model to perform a task typically requires a large amount of data
from the domains in which the task will be applied. However, it is often the
case that data are abundant in some domains but scarce in others. Domain
adaptation deals with the challenge of adapting a model trained from a
data-rich source domain to perform well in a data-poor target domain. In
general, this requires learning plausible mappings between domains. CycleGAN is
a powerful framework that efficiently learns to map inputs from one domain to
another using adversarial training and a cycle-consistency constraint. However,
the conventional approach of enforcing cycle-consistency via reconstruction may
be overly restrictive in cases where one or more domains have limited training
data. In this paper, we propose an augmented cyclic adversarial learning model
that enforces the cycle-consistency constraint via an external task specific
model, which encourages the preservation of task-relevant content as opposed to
exact reconstruction. We explore digit classification in a low-resource setting
in supervised, semi and unsupervised situation, as well as high resource
unsupervised. In low-resource supervised setting, the results show that our
approach improves absolute performance by 14% and 4% when adapting SVHN to
MNIST and vice versa, respectively, which outperforms unsupervised domain
adaptation methods that require high-resource unlabeled target domain.
Moreover, using only few unsupervised target data, our approach can still
outperforms many high-resource unsupervised models. In speech domains, we
similarly adopt a speech recognition model from each domain as the task
specific model. Our approach improves absolute performance of speech
recognition by 2% for female speakers in the TIMIT dataset, where the majority
of training samples are from male voices.Comment: 14 pages, 5 figures, 8 tables; Accepted as a conference paper at ICLR
201
Unsupervised Feature Learning for Environmental Sound Classification Using Weighted Cycle-Consistent Generative Adversarial Network
In this paper we propose a novel environmental sound classification approach
incorporating unsupervised feature learning from codebook via spherical
-Means++ algorithm and a new architecture for high-level data augmentation.
The audio signal is transformed into a 2D representation using a discrete
wavelet transform (DWT). The DWT spectrograms are then augmented by a novel
architecture for cycle-consistent generative adversarial network. This
high-level augmentation bootstraps generated spectrograms in both intra and
inter class manners by translating structural features from sample to sample. A
codebook is built by coding the DWT spectrograms with the speeded-up robust
feature detector (SURF) and the K-Means++ algorithm. The Random Forest is our
final learning algorithm which learns the environmental sound classification
task from the clustered codewords in the codebook. Experimental results in four
benchmarking environmental sound datasets (ESC-10, ESC-50, UrbanSound8k, and
DCASE-2017) have shown that the proposed classification approach outperforms
the state-of-the-art classifiers in the scope, including advanced and dense
convolutional neural networks such as AlexNet and GoogLeNet, improving the
classification rate between 3.51% and 14.34%, depending on the dataset.Comment: Paper Accepted for Publication in Elsevier Applied Soft Computin
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
It is important to transcribe and archive speech data of endangered languages
for preserving heritages of verbal culture and automatic speech recognition
(ASR) is a powerful tool to facilitate this process. However, since endangered
languages do not generally have large corpora with many speakers, the
performance of ASR models trained on them are considerably poor in general.
Nevertheless, we are often left with a lot of recordings of spontaneous speech
data that have to be transcribed. In this work, for mitigating this speaker
sparsity problem, we propose to convert the whole training speech data and make
it sound like the test speaker in order to develop a highly accurate ASR system
for this speaker. For this purpose, we utilize a CycleGAN-based non-parallel
voice conversion technology to forge a labeled training data that is close to
the test speaker's speech. We evaluated this speaker adaptation approach on two
low-resource corpora, namely, Ainu and Mboshi. We obtained 35-60% relative
improvement in phone error rate on the Ainu corpus, and 40% relative
improvement was attained on the Mboshi corpus. This approach outperformed two
conventional methods namely unsupervised adaptation and multilingual training
with these two corpora.Comment: Accepted for Interspeech 202
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks
We propose a learning-based filter that allows us to directly modify a
synthetic speech waveform into a natural speech waveform. Speech-processing
systems using a vocoder framework such as statistical parametric speech
synthesis and voice conversion are convenient especially for a limited number
of data because it is possible to represent and process interpretable acoustic
features over a compact space, such as the fundamental frequency (F0) and
mel-cepstrum. However, a well-known problem that leads to the quality
degradation of generated speech is an over-smoothing effect that eliminates
some detailed structure of generated/converted acoustic features. To address
this issue, we propose a synthetic-to-natural speech waveform conversion
technique that uses cycle-consistent adversarial networks and which does not
require any explicit assumption about speech waveform in adversarial learning.
In contrast to current techniques, since our modification is performed at the
waveform level, we expect that the proposed method will also make it possible
to generate `vocoder-less' sounding speech even if the input speech is
synthesized using a vocoder framework. The experimental results demonstrate
that our proposed method can 1) alleviate the over-smoothing effect of the
acoustic features despite the direct modification method used for the waveform
and 2) greatly improve the naturalness of the generated speech sounds.Comment: SLT201
Cycle-Consistent Speech Enhancement
Feature mapping using deep neural networks is an effective approach for
single-channel speech enhancement. Noisy features are transformed to the
enhanced ones through a mapping network and the mean square errors between the
enhanced and clean features are minimized. In this paper, we propose a
cycle-consistent speech enhancement (CSE) in which an additional inverse
mapping network is introduced to reconstruct the noisy features from the
enhanced ones. A cycle-consistent constraint is enforced to minimize the
reconstruction loss. Similarly, a backward cycle of mappings is performed in
the opposite direction with the same networks and losses. With
cycle-consistency, the speech structure is well preserved in the enhanced
features while noise is effectively reduced such that the feature-mapping
network generalizes better to unseen data. In cases where only unparalleled
noisy and clean data is available for training, two discriminator networks are
used to distinguish the enhanced and noised features from the clean and noisy
ones. The discrimination losses are jointly optimized with reconstruction
losses through adversarial multi-task learning. Evaluated on the CHiME-3
dataset, the proposed CSE achieves 19.60% and 6.69% relative word error rate
improvements respectively when using or without using parallel clean and noisy
speech data.Comment: 5 pages, 2 figures. Interspeech 2018. arXiv admin note: text overlap
with arXiv:1809.0225
A Style Transfer Approach to Source Separation
Training neural networks for source separation involves presenting a mixture
recording at the input of the network and updating network parameters in order
to produce an output that resembles the clean source. Consequently, supervised
source separation depends on the availability of paired mixture-clean training
examples. In this paper, we interpret source separation as a style transfer
problem. We present a variational auto-encoder network that exploits the
commonality across the domain of mixtures and the domain of clean sounds and
learns a shared latent representation across the two domains. Using these
cycle-consistent variational auto-encoders, we learn a mapping from the mixture
domain to the domain of clean sounds and perform source separation without
explicitly supervising with paired training examples
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation
WaveCycleGAN has recently been proposed to bridge the gap between natural and
synthesized speech waveforms in statistical parametric speech synthesis and
provides fast inference with a moving average model rather than an
autoregressive model and high-quality speech synthesis with the adversarial
training. However, the human ear can still distinguish the processed speech
waveforms from natural ones. One possible cause of this distinguishability is
the aliasing observed in the processed speech waveform via down/up-sampling
modules. To solve the aliasing and provide higher quality speech synthesis, we
propose WaveCycleGAN2, which 1) uses generators without down/up-sampling
modules and 2) combines discriminators of the waveform domain and acoustic
parameter domain. The results show that the proposed method 1) alleviates the
aliasing well, 2) is useful for both speech waveforms generated by
analysis-and-synthesis and statistical parametric speech synthesis, and 3)
achieves a mean opinion score comparable to those of natural speech and speech
synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech
samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.Comment: Submitted to INTERSPEECH201
- …