9 research outputs found
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
End-to-end (E2E) approach is gradually replacing hybrid models for automatic
speech recognition (ASR) tasks. However, the optimization of E2E models lacks
an intuitive method for handling decoding shifts, especially in scenarios with
a large number of domain-specific rare words that hold specific important
meanings. Furthermore, the absence of knowledge-intensive speech datasets in
academia has been a significant limiting factor, and the commonly used speech
corpora exhibit significant disparities with realistic conversation. To address
these challenges, we present Medical Interview (MED-IT), a multi-turn
consultation speech dataset that contains a substantial number of
knowledge-intensive named entities. We also explore methods to enhance the
recognition performance of rare words for E2E models. We propose a novel
approach, post-decoder biasing, which constructs a transform probability matrix
based on the distribution of training transcriptions. This guides the model to
prioritize recognizing words in the biasing list. In our experiments, for
subsets of rare words appearing in the training speech between 10 and 20 times,
and between 1 and 5 times, the proposed method achieves a relative improvement
of 9.3% and 5.1%, respectively
CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals
We propose a framework to modularize the training of neural language models
that use diverse forms of sentence-external context (including metadata) by
eliminating the need to jointly train sentence-external and within-sentence
encoders. Our approach, contextual universal embeddings (CUE), trains LMs on
one set of context, such as date and author, and adapts to novel metadata
types, such as article title, or previous sentence. The model consists of a
pretrained neural sentence LM, a BERT-based context encoder, and a masked
transformer decoder that estimates LM probabilities using sentence-internal and
sentence-external information. When context or metadata are unavailable, our
model learns to combine contextual and sentence-internal information using
noisy oracle unigram embeddings as a proxy. Real contextual information can be
introduced later and used to adapt a small number of parameters that map
contextual data into the decoder's embedding space. We validate the CUE
framework on a NYTimes text corpus with multiple metadata types, for which the
LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context.
Bootstrapping a contextual LM with only a subset of the context/metadata during
training retains 85\% of the achievable gain. Training the model initially with
proxy context retains 67% of the perplexity gain after adapting to real
context. Furthermore, we can swap one type of pretrained sentence LM for
another without retraining the context encoders, by only adapting the decoder
model. Overall, we obtain a modular framework that allows incremental, scalable
training of context-enhanced LMs.Comment: To appear in Findings of ACL 202
AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy.
However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary.
The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios.
With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams.
The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies.
While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances