Search CORE

1,020 research outputs found

Audio Caption: Listen and Tell

Author: Dinkel Heinrich
Wu Mengyue
Yu Kai
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/05/2019
Field of study

Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.Comment: accepted by ICASSP201

arXiv.org e-Print Archive

Crossref

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Author: Guo Pengcheng
Huang Kaixun
Mu Bingshen
Xie Lei
Xu Tianyi
Yang Zhanheng
Zhang Ao
Publication venue
Publication date: 29/05/2023
Field of study

Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.Comment: Accepted by interspeech202

arXiv.org e-Print Archive

An integrated approach to speech recognition using phrase-based units

Author: Watkins Christopher James
Publication venue: University of East Anglia
Publication date: 01/01/2010
Field of study

University of East Anglia digital repository

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Author: Chen Changru
Guo Pengcheng
Huang Kaixun
Li Biao
Li Chao
Xie Lei
Xu Tianyi
Yang Zhanheng
Zhang Ao
Publication venue
Publication date: 15/08/2023
Field of study

By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase

arXiv.org e-Print Archive