3 research outputs found
Batch-normalized joint training for DNN-based distant speech recognition
Improving distant speech recognition is a crucial step towards flexible
human-machine interfaces. Current technology, however, still exhibits a lack of
robustness, especially when adverse acoustic conditions are met. Despite the
significant progress made in the last years on both speech enhancement and
speech recognition, one potential limitation of state-of-the-art technology
lies in composing modules that are not well matched because they are not
trained jointly. To address this concern, a promising approach consists in
concatenating a speech enhancement and a speech recognition deep neural network
and to jointly update their parameters as if they were within a single bigger
network. Unfortunately, joint training can be difficult because the output
distribution of the speech enhancement system may change substantially during
the optimization procedure. The speech recognition module would have to deal
with an input distribution that is non-stationary and unnormalized. To mitigate
this issue, we propose a joint training approach based on a fully
batch-normalized architecture. Experiments, conducted using different datasets,
tasks and acoustic conditions, revealed that the proposed framework
significantly overtakes other competitive solutions, especially in challenging
environments.Comment: arXiv admin note: text overlap with arXiv:1703.0800
Adversarial Joint Training with Self-Attention Mechanism for Robust End-to-End Speech Recognition
Lately, the self-attention mechanism has marked a new milestone in the field
of automatic speech recognition (ASR). Nevertheless, its performance is
susceptible to environmental intrusions as the system predicts the next output
symbol depending on the full input sequence and the previous predictions.
Inspired by the extensive applications of the generative adversarial networks
(GANs) in speech enhancement and ASR tasks, we propose an adversarial joint
training framework with the self-attention mechanism to boost the noise
robustness of the ASR system. Generally, it consists of a self-attention speech
enhancement GAN and a self-attention end-to-end ASR model. There are two
highlights which are worth noting in this proposed framework. One is that it
benefits from the advancement of both self-attention mechanism and GANs; while
the other is that the discriminator of GAN plays the role of the global
discriminant network in the stage of the adversarial joint training, which
guides the enhancement front-end to capture more compatible structures for the
subsequent ASR module and thereby offsets the limitation of the separate
training and handcrafted loss functions. With the adversarial joint
optimization, the proposed framework is expected to learn more robust
representations suitable for the ASR task. We execute systematic experiments on
the corpus AISHELL-1, and the experimental results show that on the artificial
noisy test set, the proposed framework achieves the relative improvements of
66% compared to the ASR model trained by clean data solely, 35.1% compared to
the speech enhancement & ASR scheme without joint training, and 5.3% compared
to multi-condition training
JOINT DISCRIMINATIVE FRONT END AND BACK END TRAINING FOR IMPROVED SPEECH RECOGNITION ACCURACY
This paper presents a general discriminative training method for both the front end feature extractor and back end acoustic model of an automatic speech recognition system. The front end and back end parameters are jointly trained using the Rprop algorithm against a maximum mutual information (MMI) objective function. Results are presented on the Aurora 2 noisy English digit recognition task. It is shown that discriminative training of the front end or back end alone can improve accuracy, but joint training is considerably better. 1