11 research outputs found

    Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

    Full text link
    Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.Comment: Accepted by Interspeech 202

    External Language Model Integration for Factorized Neural Transducers

    Full text link
    We propose an adaptation method for factorized neural transducers (FNT) with external language models. We demonstrate that both neural and n-gram external LMs add significantly more value when linearly interpolated with predictor output compared to shallow fusion, thus confirming that FNT forces the predictor to act like regular language models. Further, we propose a method to integrate class-based n-gram language models into FNT framework resulting in accuracy gains similar to a hybrid setup. We show average gains of 18% WERR with lexical adaptation across various scenarios and additive gains of up to 60% WERR in one entity-rich scenario through a combination of class-based n-gram and neural LMs

    Self-supervised learning for automatic speech recognition In low-resource environments

    Get PDF
    Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a significant challenge arises from the heavy reliance on extensive annotated data for training these systems. This reliance poses a significant scalability limitation, hindering the continual enhancement of state-of-the-art performance. Moreover, it presents a more fundamental obstacle for deploying deep neural networks in speech-related domains where acquiring labeled data is inherently arduous, expensive, or time-intensive, which are considered as low-resource ASR problems in this thesis. Unlike annotated speech data, collecting untranscribed audio is typically more cost-effective. In this thesis, we investigate the application of self-supervised learning in low-resource tasks, a learning approach where the learning objective is derived directly from the input data itself. We employ this method to harness the scalability and affordability of untranscribed audio resources in problems where we do not have enough training data, with the goal of enhancing the performance of spoken language technology. In particular, we propose three self-supervised methodologies. One model is based on the concept of two-fine-tuning steps, while the other two revolve around the notion of identifying an improved hidden unit. These approaches are designed to learn contextualized speech representations from speech data lacking annotations. We demonstrate the capacity of our self-supervised techniques to learn representations that convert the higher-level characteristics of speech signals more effectively than conventional acoustic features. Additionally, we present how these representations enhance the performance of deep neural networks on ASR tasks with limited resources. Beyond introducing novel learning algorithms, we conduct in-depth analyses to comprehend the properties of the acquired self-supervised representations and elucidate the distinct design elements that separate one self-supervised model from another
    corecore