62 research outputs found

    Multitask Learning with CTC and Segmental CRF for Speech Recognition

    Full text link
    Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling methods used for end-to-end training of speech recognition models. Both models define a transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labels and durations, and the latter classifies each frame as either an output symbol or a "continuation" of the previous label. In this paper, we train a recognition model by optimizing an interpolation between the SCRF and CTC losses, where the same recurrent neural network (RNN) encoder is used for feature extraction for both outputs. We find that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models. Additionally, we show that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.Comment: 5 pages, 2 figures, camera ready version at Interspeech 201

    End-to-end neural segmental models for speech recognition

    Get PDF
    Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed as consisting of a neural network-based acoustic encoder and a finite-state transducer decoder. We study end-to-end segmental models with different weight functions, including ones based on frame-level neural classifiers and on segmental recurrent neural networks. We study how reducing the search space size impacts performance under different weight functions. We also compare several loss functions for end-to-end training. Finally, we explore training approaches, including multi-stage vs. end-to-end training and multitask training that combines segmental and frame-level losses

    Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

    Get PDF
    Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.Comment: Accepted at INTERSPEECH 201

    АналитичСский ΠΎΠ±Π·ΠΎΡ€ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ

    Get PDF
    This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠŸΡ€ΠΈΠ²Π΅Π΄Π΅Π½ аналитичСский ΠΎΠ±Π·ΠΎΡ€ разновидностСй ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… (end-to-end) систСм для распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈΡ… построСния, обучСния ΠΈ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ. РассмотрСны Π²Π°Ρ€ΠΈΠ°Π½Ρ‚Ρ‹ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° основС ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ Π²Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠΉ классификации (CTC) Π² качСствС Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ ΠΏΠΎΡ‚Π΅Ρ€ΡŒ для Π½Π΅ΠΉΡ€ΠΎΠ½Π½ΠΎΠΉ сСти, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° основС ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ° внимания ΠΈ ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€-Π΄Π΅ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ рассмотрСны Π½Π΅ΠΉΡ€ΠΎΠ½Π½Ρ‹Π΅ сСти, построСнныС с использованиСм условных случайных ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΡΠ²Π»ΡΡŽΡ‚ΡΡ ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ΠΌ скрытых марковских ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, Ρ‡Ρ‚ΠΎ позволяСт ΠΈΡΠΏΡ€Π°Π²ΠΈΡ‚ΡŒ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ нСдостатки стандартных Π³ΠΈΠ±Ρ€ΠΈΠ΄Π½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ Ρ‚ΠΎΠΌ, Ρ‡Ρ‚ΠΎ элСмСнты Π²Ρ…ΠΎΠ΄Π½Ρ‹Ρ… ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚Π΅ΠΉ Π·Π²ΡƒΠΊΠΎΠ² Ρ€Π΅Ρ‡ΠΈ ΡΠ²Π»ΡΡŽΡ‚ΡΡ нСзависимыми случайными Π²Π΅Π»ΠΈΡ‡ΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ описаны возмоТности ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΈ с языковыми модСлями Π½Π° этапС дСкодирования, Π΄Π΅ΠΌΠΎΠ½ΡΡ‚Ρ€ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠ΅ сущСствСнноС сокращСниС ошибки распознавания для ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠžΠΏΠΈΡΠ°Π½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ ΠΈ ΡƒΠ»ΡƒΡ‡ΡˆΠ΅Π½ΠΈΡ стандартных ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… Π°Ρ€Ρ…ΠΈΡ‚Π΅ΠΊΡ‚ΡƒΡ€ систСм распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ классификации ΠΈ использовании рСгуляризации Π² модСлях, основанных Π½Π° ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ°Ρ… внимания. ΠžΠ±Π·ΠΎΡ€ исслСдований, ΠΏΡ€ΠΎΠ²ΠΎΠ΄ΠΈΠΌΡ‹Ρ… Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области, ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚, Ρ‡Ρ‚ΠΎ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Π΅ систСмы распознавания Ρ€Π΅Ρ‡ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ Π΄ΠΎΡΡ‚ΠΈΡ‡ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², сравнимых с Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π°ΠΌΠΈ стандартных систСм, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‰ΠΈΡ… скрытыС марковскиС ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ простой ΠΊΠΎΠ½Ρ„ΠΈΠ³ΡƒΡ€Π°Ρ†ΠΈΠΈ ΠΈ быстрой Ρ€Π°Π±ΠΎΡ‚ΠΎΠΉ систСмы распознавания ΠΊΠ°ΠΊ ΠΏΡ€ΠΈ ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠΈ, Ρ‚Π°ΠΊ ΠΈ ΠΏΡ€ΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠΈ. РассмотрСны Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ популярныС ΠΈ Ρ€Π°Π·Π²ΠΈΠ²Π°ΡŽΡ‰ΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠΈ ΠΈ инструмСнтарии для построСния ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Ρ‚Π°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΠ΅. ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΎ сравнСниС описанных инструмСнтариСв ΠΏΠΎ критСриям простоты ΠΈ доступности ΠΈΡ… использования для Ρ€Π΅Π°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ

    Towards End-to-End Speech Recognition

    Get PDF
    Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which allow training of systems in an end-to-end manner. Such approaches have found success in the area of natural language processing and computer vision community, and have consequently peaked interest in the speech community. The present thesis builds on these recent advances to investigate approaches to develop speech recognition systems in end-to-end manner. In that respect, the thesis follows two main axes of research. The first axis of research focuses on joint learning of features and classifiers for acoustic modeling. The second axis of research focuses on joint modeling of the acoustic model and the decoder. Along the first axis of research, in the framework of hybrid hidden Markov model/artificial neural networks (HMM/ANN) based ASR, we develop a convolution neural networks (CNNs) based acoustic modeling approach that takes raw speech signal as input and estimates phone class conditional probabilities. Specifically, the CNN has several convolution layers (feature stage) followed by multilayer perceptron (classifier stage), which are jointly optimized during the training. Through ASR studies on multiple languages and extensive analysis of the approach, we show that the proposed approach, with minimal prior knowledge, is able to learn automatically the relevant features from the raw speech signal. This approach yields systems that have less number of parameters and achieves better performance, when compared to the conventional approach of cepstral feature extraction followed by classifier training. As the features are automatically learned from the signal, a natural question that arises is: are such systems robust to noise? Towards that we propose a robust CNN approach referred to as normalized CNN approach, which yields systems that are as robust as or better than the conventional ASR systems using cepstral features (with feature level normalizations). The second axis of research focuses on end-to-end sequence-to-sequence conversion. We first propose an end-to-end phoneme recognition system. In this system the relevant features, classifier and the decoder (based on conditional random fields) are jointly modeled during training. We demonstrate the viability of the approach on TIMIT phoneme recognition task. Building on top of that, we investigate a ``weakly supervised'' training that alleviates the necessity for frame level alignments. Finally, we extend the weakly supervised approach to propose a novel keyword spotting technique. In this technique, a CNN first process the input observation sequence to output word level scores, which are subsequently aggregated to detect or spot words. We demonstrate the potential of the approach through a comparative study on LibriSpeech with the standard approach of keyword word spotting based on lattice indexing using ASR system

    SynJax: Structured Probability Distributions for JAX

    Full text link
    The development of deep learning software libraries enabled significant progress in the field by allowing users to focus on modeling, while letting the library to take care of the tedious and time-consuming task of optimizing execution for modern hardware accelerators. However, this has benefited only particular types of deep learning models, such as Transformers, whose primitives map easily to the vectorized computation. The models that explicitly account for structured objects, such as trees and segmentations, did not benefit equally because they require custom algorithms that are difficult to implement in a vectorized form. SynJax directly addresses this problem by providing an efficient vectorized implementation of inference algorithms for structured distributions covering alignment, tagging, segmentation, constituency trees and spanning trees. With SynJax we can build large-scale differentiable models that explicitly model structure in the data. The code is available at https://github.com/deepmind/synjax
    • …
    corecore