234 research outputs found
Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
The neural transducer is an end-to-end model for automatic speech recognition
(ASR). While the model is well-suited for streaming ASR, the training process
remains challenging. During training, the memory requirements may quickly
exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence
lengths. In this work, we analyze the time and space complexity of a typical
transducer training setup. We propose a memory-efficient training method that
computes the transducer loss and gradients sample by sample. We present
optimizations to increase the efficiency and parallelism of the sample-wise
method. In a set of thorough benchmarks, we show that our sample-wise method
significantly reduces memory usage, and performs at competitive speed when
compared to the default batched computation. As a highlight, we manage to
compute the transducer loss and gradients for a batch size of 1024, and audio
length of 40 seconds, using only 6 GB of memory.Comment: 5 pages, 4 figures, 1 table, 1 algorith
cmu gale speech-to-text system,”
Abstract This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation ("GALE") domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques
Метаболические изменения у пациентов с артериальной гипертонией на фоне рациональной гипотензивной фармакотерапии
ГИПЕРТЕНЗИЯ /ЛЕК ТЕРКРОВЕНОСНЫХ СОСУДОВ БОЛЕЗНИОБМЕНА ВЕЩЕСТВ БОЛЕЗНИМЕТАБОЛИЧЕСКИЙ СИНДРОМ XАНТИГИПЕРТЕНЗИВНЫЕ СРЕДСТВАЛЕКАРСТВЕННАЯ ТЕРАПИ
Eleven generations of selection for the duration of fertility in the intergeneric crossbreeding of ducks
A 12-generation selection experiment involving a selected line (S) and a control line (C) has been conducted since 1992 with the aim of increasing the number of fertile eggs laid by the Brown Tsaiya duck after a single artificial insemination (AI) with pooled Muscovy semen. On average, 28.9% of the females and 17.05% of the males were selected. The selection responses and the predicted responses showed similar trends. The average predicted genetic responses per generation in genetic standard deviation units were 0.40 for the number of fertile eggs, 0.45 for the maximum duration of fertility, and 0.32 for the number of hatched mule ducklings' traits. The fertility rates for days 2–8 after AI were 89.14% in the S line and 61.46% in the C line. Embryo viability was not impaired by this selection. The largest increase in fertility rate per day after a single AI was observed from d5 to d11. In G12, the fertility rate in the selected line was 91% at d2, 94% at d3, 92% at days 3 and 4 then decreased to 81% at d8, 75% at d9, 58% at d10 and 42% at d11. In contrast, the fertility rate in the control line showed an abrupt decrease from d4 (74%). The same tendencies were observed for the evolution of hatchability according to the egg set rates. It was concluded that selection for the number of fertile eggs after a single AI with pooled Muscovy semen could effectively increase the duration of the fertile period in ducks and that research should now be focused on ways to improve the viability of the hybrid mule duck embryo
Variable Attention Masking for Configurable Transformer Transducer Speech Recognition
This work studies the use of attention masking in transformer transducer
based speech recognition for building a single configurable model for different
deployment scenarios. We present a comprehensive set of experiments comparing
fixed masking, where the same attention mask is applied at every frame, with
chunked masking, where the attention mask for each frame is determined by chunk
boundaries, in terms of recognition accuracy and latency. We then explore the
use of variable masking, where the attention masks are sampled from a target
distribution at training time, to build models that can work in different
configurations. Finally, we investigate how a single configurable model can be
used to perform both first pass streaming recognition and second pass acoustic
rescoring. Experiments show that chunked masking achieves a better accuracy vs
latency trade-off compared to fixed masking, both with and without FastEmit. We
also show that variable masking improves the accuracy by up to 8% relative in
the acoustic re-scoring scenario.Comment: 5 pages, 4 figures, 2 Table
- …