Search CORE

7 research outputs found

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Author: Andrusenko Andrei
Nasretdinov Rauf
Romanenko Aleksei
Publication venue
Publication date: 03/10/2022
Field of study

Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence issue connected with such a significant reduction of the time dimension, we use upsampling blocks like in the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference speed but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model shows 47.8% and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3% and 9.2% on LibriSpeech test_clean and test_other respectively.Comment: 5 pages, 1 figur

arXiv.org e-Print Archive

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Author: Andrusenko Andrei
Korostik Roman
Laptev Aleksandr
Medennikov Ivan
Rybin Sergey
Svischev Aleksey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2020
Field of study

Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other

arXiv.org e-Print Archive

Crossref

Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition

Author: Andrusenko Andrei
Laptev Aleksandr
Matveev Yuri
Medennikov Ivan
Mitrofanov Anton
Podluzhny Ivan
Publication venue: 'MDPI AG'
Publication date: 12/03/2021
Field of study

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Another challenging task associated with speech assistants is personalization, which mainly lies in handling out-of-vocabulary (OOV) words. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. To address the aforementioned problems, we propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. It non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative WER and 25% relative F-score) at no additional computational cost. Owing to the use of BPE-dropout, our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is close to the best published multilingual system.Comment: 16 pages, 7 figure

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Author: Andrusenko Andrei
Khokhlov Yuri
Korenevskaya Mariya
Korenevsky Maxim
Laptev Aleksandr
Medennikov Ivan
Mitrofanov Anton
Podluzhny Ivan
Prisyach Tatiana
Romanenko Aleksei
Sorokin Ivan
Timofeeva Tatiana
Publication venue: 'International Speech Communication Association'
Publication date: 27/07/2020
Field of study

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Crossref

Search for Gravitational-Neutrino Correlations on Ground-Based Detectors

Author: Andrei Gusev
Daniil Krichevskiy
Sergei Oreshkin
Sergei Popov
Svetlana Andrusenko
Valentin Rudenko
Yurii Gavriluk
Publication venue: MDPI AG
Publication date: 01/08/2022
Field of study

The problem of joint data processing from ground-based gravitational and neutrino detectors is considered in order to increase the detection efficiency of collapsing objects in the Galaxy. The development of the “neutrino-gravitational correlation” algorithm is carried out within the framework of the theory of optimal filtration as applied to the well-known OGRAN and BUST facilities located at the BNO INR RAS. The experience of analyzing neutrino and gravitational data obtained during the outburst of supernova SN1987A is used. Sequential steps of the algorithm are presented; formulas for estimating the statistical efficiency of a two-channel recorder are obtained

Directory of Open Access Journals

Search for Gravitational-Neutrino Correlations on Ground-Based Detectors

Author: Andrei Gusev
Daniil Krichevskiy
Sergei Oreshkin
Sergei Popov
Svetlana Andrusenko
Valentin Rudenko
Yurii Gavriluk
Publication venue: 'MDPI AG'
Publication date: 26/08/2022
Field of study

Multidisciplinary Digital Publishing Institute