1,020 research outputs found
Audio Caption: Listen and Tell
Increasing amount of research has shed light on machine perception of audio
events, most of which concerns detection and classification tasks. However,
human-like perception of audio scenes involves not only detecting and
classifying audio sounds, but also summarizing the relationship between
different audio events. Comparable research such as image caption has been
conducted, yet the audio field is still quite barren. This paper introduces a
manually-annotated dataset for audio caption. The purpose is to automatically
generate natural sentences for audio scene description and to bridge the gap
between machine perception of audio and image. The whole dataset is labelled in
Mandarin and we also include translated English annotations. A baseline
encoder-decoder model is provided for both English and Mandarin. Similar BLEU
scores are derived for both languages: our model can generate understandable
and data-related captions based on the dataset.Comment: accepted by ICASSP201
Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
Contextual information plays a crucial role in speech recognition
technologies and incorporating it into the end-to-end speech recognition models
has drawn immense interest recently. However, previous deep bias methods lacked
explicit supervision for bias tasks. In this study, we introduce a contextual
phrase prediction network for an attention-based deep bias method. This network
predicts context phrases in utterances using contextual embeddings and
calculates bias loss to assist in the training of the contextualized model. Our
method achieved a significant word error rate (WER) reduction across various
end-to-end speech recognition models. Experiments on the LibriSpeech corpus
show that our proposed model obtains a 12.1% relative WER improvement over the
baseline model, and the WER of the context phrases decreases relatively by
40.5%. Moreover, by applying a context phrase filtering strategy, we also
effectively eliminate the WER degradation when using a larger biasing list.Comment: Accepted by interspeech202
Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition
By incorporating additional contextual information, deep biasing methods have
emerged as a promising solution for speech recognition of personalized words.
However, for real-world voice assistants, always biasing on such personalized
words with high prediction scores can significantly degrade the performance of
recognizing common words. To address this issue, we propose an adaptive
contextual biasing method based on Context-Aware Transformer Transducer (CATT)
that utilizes the biased encoder and predictor embeddings to perform streaming
prediction of contextual phrase occurrences. Such prediction is then used to
dynamically switch the bias list on and off, enabling the model to adapt to
both personalized and common scenarios. Experiments on Librispeech and internal
voice assistant datasets show that our approach can achieve up to 6.7% and
20.7% relative reduction in WER and CER compared to the baseline respectively,
mitigating up to 96.7% and 84.9% of the relative WER and CER increase for
common cases. Furthermore, our approach has a minimal performance impact in
personalized scenarios while maintaining a streaming inference pipeline with
negligible RTF increase
- …