234 research outputs found

    Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

    Full text link
    The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.Comment: 5 pages, 4 figures, 1 table, 1 algorith

    cmu gale speech-to-text system,”

    Get PDF
    Abstract This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation ("GALE") domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques

    Eleven generations of selection for the duration of fertility in the intergeneric crossbreeding of ducks

    Get PDF
    A 12-generation selection experiment involving a selected line (S) and a control line (C) has been conducted since 1992 with the aim of increasing the number of fertile eggs laid by the Brown Tsaiya duck after a single artificial insemination (AI) with pooled Muscovy semen. On average, 28.9% of the females and 17.05% of the males were selected. The selection responses and the predicted responses showed similar trends. The average predicted genetic responses per generation in genetic standard deviation units were 0.40 for the number of fertile eggs, 0.45 for the maximum duration of fertility, and 0.32 for the number of hatched mule ducklings' traits. The fertility rates for days 2–8 after AI were 89.14% in the S line and 61.46% in the C line. Embryo viability was not impaired by this selection. The largest increase in fertility rate per day after a single AI was observed from d5 to d11. In G12, the fertility rate in the selected line was 91% at d2, 94% at d3, 92% at days 3 and 4 then decreased to 81% at d8, 75% at d9, 58% at d10 and 42% at d11. In contrast, the fertility rate in the control line showed an abrupt decrease from d4 (74%). The same tendencies were observed for the evolution of hatchability according to the egg set rates. It was concluded that selection for the number of fertile eggs after a single AI with pooled Muscovy semen could effectively increase the duration of the fertile period in ducks and that research should now be focused on ways to improve the viability of the hybrid mule duck embryo

    Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

    Full text link
    This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries, in terms of recognition accuracy and latency. We then explore the use of variable masking, where the attention masks are sampled from a target distribution at training time, to build models that can work in different configurations. Finally, we investigate how a single configurable model can be used to perform both first pass streaming recognition and second pass acoustic rescoring. Experiments show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking, both with and without FastEmit. We also show that variable masking improves the accuracy by up to 8% relative in the acoustic re-scoring scenario.Comment: 5 pages, 4 figures, 2 Table
    corecore