Search CORE

11 research outputs found

Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

Author: Andrusenko Andrei
Laptev Aleksandr
Matveev Yuri
Medennikov Ivan
Mitrofanov Anton
Podluzhny Ivan
Publication venue: 'MDPI AG'
Publication date: 12/03/2021
Field of study

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Another challenging task associated with speech assistants is personalization, which mainly lies in handling out-of-vocabulary (OOV) words. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. To address the aforementioned problems, we propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. It non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative WER and 25% relative F-score) at no additional computational cost. Owing to the use of BPE-dropout, our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is close to the best published multilingual system.Comment: 16 pages, 7 figure

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

A comparison of cloud-based speech recognition engines

Author: G. Jasinski Marcio
L. Herchonvicz Andrey
R. Franco Cristiano
Publication venue: 'Editora UNIVALI'
Publication date: 29/05/2019
Field of study

Human-machine interaction is present in our routines and has become increasingly natural these days. Devices can record a person’s speech, transcribe into text and execute tasks accordingly. This kind of interaction provides more productivity for several operations since it allows users to have hands free through a more natural interface. Moreover, the speech recognition engines need to assure reliability and speed. However, the maturity of speech recognition systems vary from providers and most importantly accordingly to the language. For instance, Brazilian Portuguese language has a particularity of using several foreign terms, especially if we consider corporate environments.In this paper, an experiment was conducted, to evaluate three speech recognition engines regarding accuracy and performance: Bing Speech API, Google Cloud Speech and IBM Watson Speech to Text. To obtain the accuracy value, we used a well-known string similarity algorithm. The results showed a high level of accuracy for Google Cloud Speech and Bing Speech API. However, the best accuracy provided by Google services came with a cost on performance – requiring additional time to provide the speech to text transcription

Portal de Periódicos da Univali (Universidade do Vale do Itajaí)