7 research outputs found
Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition
Optimization of modern ASR architectures is among the highest priority tasks
since it saves many computational resources for model training and inference.
The work proposes a new Uconv-Conformer architecture based on the standard
Conformer model. It consistently reduces the input sequence length by 16 times,
which results in speeding up the work of the intermediate layers. To solve the
convergence issue connected with such a significant reduction of the time
dimension, we use upsampling blocks like in the U-Net architecture to ensure
the correct CTC loss calculation and stabilize network training. The
Uconv-Conformer architecture appears to be not only faster in terms of training
and inference speed but also shows better WER compared to the baseline
Conformer. Our best Uconv-Conformer model shows 47.8% and 23.5% inference
acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3%
and 9.2% on LibriSpeech test_clean and test_other respectively.Comment: 5 pages, 1 figur
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
With the rapid development of speech assistants, adapting server-intended
automatic speech recognition (ASR) solutions to a direct device has become
crucial. Researchers and industry prefer to use end-to-end ASR systems for
on-device speech recognition tasks. This is because end-to-end systems can be
made resource-efficient while maintaining a higher quality compared to hybrid
systems. However, building end-to-end models requires a significant amount of
speech data. Another challenging task associated with speech assistants is
personalization, which mainly lies in handling out-of-vocabulary (OOV) words.
In this work, we consider building an effective end-to-end ASR system in
low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel
Georgian tasks. To address the aforementioned problems, we propose a method of
dynamic acoustic unit augmentation based on the BPE-dropout technique. It
non-deterministically tokenizes utterances to extend the token's contexts and
to regularize their distribution for the model's recognition of unseen words.
It also reduces the need for optimal subword vocabulary size search. The
technique provides a steady improvement in regular and personalized
(OOV-oriented) speech recognition tasks (at least 6% relative WER and 25%
relative F-score) at no additional computational cost. Owing to the use of
BPE-dropout, our monolingual Turkish Conformer established a competitive result
with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is
close to the best published multilingual system.Comment: 16 pages, 7 figure
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Speaker diarization for real-life scenarios is an extremely challenging
problem. Widely used clustering-based diarization approaches perform rather
poorly in such conditions, mainly due to the limited ability to handle
overlapping speech. We propose a novel Target-Speaker Voice Activity Detection
(TS-VAD) approach, which directly predicts an activity of each speaker on each
time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along
with i-vectors for each speaker as inputs. A set of binary classification
output layers produces activities of each speaker. I-vectors can be estimated
iteratively, starting with a strong clustering-based diarization. We also
extend the TS-VAD approach to the multi-microphone case using a simple
attention mechanism on top of hidden representations extracted from the
single-channel TS-VAD model. Moreover, post-processing strategies for the
predicted speaker activity probabilities are investigated. Experiments on the
CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results
outperforming the baseline x-vector-based system by more than 30% Diarization
Error Rate (DER) abs.Comment: Accepted to Interspeech 202
Search for Gravitational-Neutrino Correlations on Ground-Based Detectors
The problem of joint data processing from ground-based gravitational and neutrino detectors is considered in order to increase the detection efficiency of collapsing objects in the Galaxy. The development of the “neutrino-gravitational correlation” algorithm is carried out within the framework of the theory of optimal filtration as applied to the well-known OGRAN and BUST facilities located at the BNO INR RAS. The experience of analyzing neutrino and gravitational data obtained during the outburst of supernova SN1987A is used. Sequential steps of the algorithm are presented; formulas for estimating the statistical efficiency of a two-channel recorder are obtained
Search for Gravitational-Neutrino Correlations on Ground-Based Detectors
The problem of joint data processing from ground-based gravitational and neutrino detectors is considered in order to increase the detection efficiency of collapsing objects in the Galaxy. The development of the “neutrino-gravitational correlation” algorithm is carried out within the framework of the theory of optimal filtration as applied to the well-known OGRAN and BUST facilities located at the BNO INR RAS. The experience of analyzing neutrino and gravitational data obtained during the outburst of supernova SN1987A is used. Sequential steps of the algorithm are presented; formulas for estimating the statistical efficiency of a two-channel recorder are obtained