7 research outputs found

    End-to-End Speech Recognition From the Raw Waveform

    Get PDF
    State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.Comment: Accepted for presentation at Interspeech 201

    Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model

    Full text link
    A sequence-to-sequence model is a neural network module for mapping two sequences of different lengths. The sequence-to-sequence model has three core modules: encoder, decoder, and attention. Attention is the bridge that connects the encoder and decoder modules and improves model performance in many tasks. In this paper, we propose two ideas to improve sequence-to-sequence model performance by enhancing the attention module. First, we maintain the history of the location and the expected context from several previous time-steps. Second, we apply multiscale convolution from several previous attention vectors to the current decoder state. We utilized our proposed framework for sequence-to-sequence speech recognition and text-to-speech systems. The results reveal that our proposed extension could improve performance significantly compared to a standard attention baseline

    АналитичСский ΠΎΠ±Π·ΠΎΡ€ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ

    Get PDF
    This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠŸΡ€ΠΈΠ²Π΅Π΄Π΅Π½ аналитичСский ΠΎΠ±Π·ΠΎΡ€ разновидностСй ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… (end-to-end) систСм для распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈΡ… построСния, обучСния ΠΈ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ. РассмотрСны Π²Π°Ρ€ΠΈΠ°Π½Ρ‚Ρ‹ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° основС ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ Π²Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠΉ классификации (CTC) Π² качСствС Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ ΠΏΠΎΡ‚Π΅Ρ€ΡŒ для Π½Π΅ΠΉΡ€ΠΎΠ½Π½ΠΎΠΉ сСти, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° основС ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ° внимания ΠΈ ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€-Π΄Π΅ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ рассмотрСны Π½Π΅ΠΉΡ€ΠΎΠ½Π½Ρ‹Π΅ сСти, построСнныС с использованиСм условных случайных ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΡΠ²Π»ΡΡŽΡ‚ΡΡ ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ΠΌ скрытых марковских ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, Ρ‡Ρ‚ΠΎ позволяСт ΠΈΡΠΏΡ€Π°Π²ΠΈΡ‚ΡŒ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ нСдостатки стандартных Π³ΠΈΠ±Ρ€ΠΈΠ΄Π½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ Ρ‚ΠΎΠΌ, Ρ‡Ρ‚ΠΎ элСмСнты Π²Ρ…ΠΎΠ΄Π½Ρ‹Ρ… ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚Π΅ΠΉ Π·Π²ΡƒΠΊΠΎΠ² Ρ€Π΅Ρ‡ΠΈ ΡΠ²Π»ΡΡŽΡ‚ΡΡ нСзависимыми случайными Π²Π΅Π»ΠΈΡ‡ΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ описаны возмоТности ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΈ с языковыми модСлями Π½Π° этапС дСкодирования, Π΄Π΅ΠΌΠΎΠ½ΡΡ‚Ρ€ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠ΅ сущСствСнноС сокращСниС ошибки распознавания для ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠžΠΏΠΈΡΠ°Π½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ ΠΈ ΡƒΠ»ΡƒΡ‡ΡˆΠ΅Π½ΠΈΡ стандартных ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… Π°Ρ€Ρ…ΠΈΡ‚Π΅ΠΊΡ‚ΡƒΡ€ систСм распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ классификации ΠΈ использовании рСгуляризации Π² модСлях, основанных Π½Π° ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ°Ρ… внимания. ΠžΠ±Π·ΠΎΡ€ исслСдований, ΠΏΡ€ΠΎΠ²ΠΎΠ΄ΠΈΠΌΡ‹Ρ… Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области, ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚, Ρ‡Ρ‚ΠΎ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Π΅ систСмы распознавания Ρ€Π΅Ρ‡ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ Π΄ΠΎΡΡ‚ΠΈΡ‡ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², сравнимых с Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π°ΠΌΠΈ стандартных систСм, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‰ΠΈΡ… скрытыС марковскиС ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ простой ΠΊΠΎΠ½Ρ„ΠΈΠ³ΡƒΡ€Π°Ρ†ΠΈΠΈ ΠΈ быстрой Ρ€Π°Π±ΠΎΡ‚ΠΎΠΉ систСмы распознавания ΠΊΠ°ΠΊ ΠΏΡ€ΠΈ ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠΈ, Ρ‚Π°ΠΊ ΠΈ ΠΏΡ€ΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠΈ. РассмотрСны Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ популярныС ΠΈ Ρ€Π°Π·Π²ΠΈΠ²Π°ΡŽΡ‰ΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠΈ ΠΈ инструмСнтарии для построСния ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Ρ‚Π°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΠ΅. ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΎ сравнСниС описанных инструмСнтариСв ΠΏΠΎ критСриям простоты ΠΈ доступности ΠΈΡ… использования для Ρ€Π΅Π°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ

    Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

    Get PDF
    Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
    corecore