764 research outputs found
A practical two-stage training strategy for multi-stream end-to-end speech recognition
The multi-stream paradigm of audio processing, in which several sources are
simultaneously considered, has been an active research area for information
fusion. Our previous study offered a promising direction within end-to-end
automatic speech recognition, where parallel encoders aim to capture diverse
information followed by a stream-level fusion based on attention mechanisms to
combine the different views. However, with an increasing number of streams
resulting in an increasing number of encoders, the previous approach could
require substantial memory and massive amounts of parallel data for joint
training. In this work, we propose a practical two-stage training scheme.
Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs
are produced from a single-stream model trained with all data. Stage-2
formulates a multi-stream scheme intending to solely train the attention fusion
module using the UFE features and pretrained components from Stage-1.
Experiments have been conducted on two datasets, DIRHA and AMI, as a
multi-stream scenario. Compared with our previous method, this strategy
achieves relative word error rate reductions of 8.2--32.4%, while consistently
outperforming several conventional combination methods.Comment: submitted to ICASSP 201
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Consonant and vowel reduction are often encountered in speech, which might
cause performance degradation in automatic speech recognition (ASR). Our
recently proposed learning strategy based on masking, Phone Masking Training
(PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT
achieves remarkably improvements, there still exists room for further gains due
to the granularity mismatch between the masking unit of PMT (phoneme) and the
modeling unit (word-piece). To boost the performance of PMT, we propose
multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The
idea of MMUT framework is to split the Encoder into two parts including
acoustic feature sequences to phoneme-level representation (AF-to-PLR) and
phoneme-level representation to word-piece-level representation (PLR-to-WPLR).
It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss
to learn the rich phoneme-level context information brought by PMT.
Experimental results on Uyghur ASR show that the proposed approaches outperform
obviously the pure PMT. We also conduct experiments on the 960-hour Librispeech
benchmark using ESPnet1, which achieves about 10% relative WER reduction on all
the test set without LM fusion comparing with the latest official ESPnet1
pre-trained model.Comment: Accepted to INTERSPEECH 202
- …