62 research outputs found
Multitask Learning with CTC and Segmental CRF for Speech Recognition
Segmental conditional random fields (SCRFs) and connectionist temporal
classification (CTC) are two sequence labeling methods used for end-to-end
training of speech recognition models. Both models define a transcription
probability by marginalizing decisions about latent segmentation alternatives
to derive a sequence probability: the former uses a globally normalized joint
model of segment labels and durations, and the latter classifies each frame as
either an output symbol or a "continuation" of the previous label. In this
paper, we train a recognition model by optimizing an interpolation between the
SCRF and CTC losses, where the same recurrent neural network (RNN) encoder is
used for feature extraction for both outputs. We find that this multitask
objective improves recognition accuracy when decoding with either the SCRF or
CTC models. Additionally, we show that CTC can also be used to pretrain the RNN
encoder, which improves the convergence rate when learning the joint model.Comment: 5 pages, 2 figures, camera ready version at Interspeech 201
End-to-end neural segmental models for speech recognition
Segmental models are an alternative to frame-based models for sequence
prediction, where hypothesized path weights are based on entire segment scores
rather than a single frame at a time. Neural segmental models are segmental
models that use neural network-based weight functions. Neural segmental models
have achieved competitive results for speech recognition, and their end-to-end
training has been explored in several studies. In this work, we review neural
segmental models, which can be viewed as consisting of a neural network-based
acoustic encoder and a finite-state transducer decoder. We study end-to-end
segmental models with different weight functions, including ones based on
frame-level neural classifiers and on segmental recurrent neural networks. We
study how reducing the search space size impacts performance under different
weight functions. We also compare several loss functions for end-to-end
training. Finally, we explore training approaches, including multi-stage vs.
end-to-end training and multitask training that combines segmental and
frame-level losses
Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
Recently, the connectionist temporal classification (CTC) model coupled with
recurrent (RNN) or convolutional neural networks (CNN), made it easier to train
speech recognition systems in an end-to-end fashion. However in real-valued
models, time frame components such as mel-filter-bank energies and the cepstral
coefficients obtained from them, together with their first and second order
derivatives, are processed as individual elements, while a natural alternative
is to process such components as composed entities. We propose to group such
elements in the form of quaternions and to process these quaternions using the
established quaternion algebra. Quaternion numbers and quaternion neural
networks have shown their efficiency to process multidimensional inputs as
entities, to encode internal dependencies, and to solve many tasks with less
learning parameters than real-valued models. This paper proposes to integrate
multiple feature views in quaternion-valued convolutional neural network
(QCNN), to be used for sequence-to-sequence mapping with the CTC model.
Promising results are reported using simple QCNNs in phoneme recognition
experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme
error rate (PER) with less learning parameters than a competing model based on
real-valued CNNs.Comment: Accepted at INTERSPEECH 201
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ
This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΠ°Π·Π½ΠΎΠ²ΠΈΠ΄Π½ΠΎΡΡΠ΅ΠΉ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
(end-to-end) ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈΡ
ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ, ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΈ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π²Π°ΡΠΈΠ°Π½ΡΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ (CTC) Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΡΡΠ½ΠΊΡΠΈΠΈ ΠΏΠΎΡΠ΅ΡΡ Π΄Π»Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ° Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ ΠΈ ΡΠΈΡΡΠ°ΡΠΎΡ-Π΄Π΅ΡΠΈΡΡΠ°ΡΠΎΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΠ΅ ΡΠ΅ΡΠΈ, ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΡΠ΅ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΡΡΠ»ΠΎΠ²Π½ΡΡ
ΡΠ»ΡΡΠ°ΠΉΠ½ΡΡ
ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡΠΎΡΡΠ΅ ΡΠ²Π»ΡΡΡΡΡ ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ΠΌ ΡΠΊΡΡΡΡΡ
ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, ΡΡΠΎ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΈΡΠΏΡΠ°Π²ΠΈΡΡ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π½Π΅Π΄ΠΎΡΡΠ°ΡΠΊΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
Π³ΠΈΠ±ΡΠΈΠ΄Π½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ ΡΠΎΠΌ, ΡΡΠΎ ΡΠ»Π΅ΠΌΠ΅Π½ΡΡ Π²Ρ
ΠΎΠ΄Π½ΡΡ
ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠ΅ΠΉ Π·Π²ΡΠΊΠΎΠ² ΡΠ΅ΡΠΈ ΡΠ²Π»ΡΡΡΡΡ Π½Π΅Π·Π°Π²ΠΈΡΠΈΠΌΡΠΌΠΈ ΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌΠΈ Π²Π΅Π»ΠΈΡΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ ΠΎΠΏΠΈΡΠ°Π½Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΈ Ρ ΡΠ·ΡΠΊΠΎΠ²ΡΠΌΠΈ ΠΌΠΎΠ΄Π΅Π»ΡΠΌΠΈ Π½Π° ΡΡΠ°ΠΏΠ΅ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΡ, Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΡΡΠΈΠ΅ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ΅ ΡΠΎΠΊΡΠ°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π΄Π»Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠΠΏΠΈΡΠ°Π½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ ΠΌΠΎΠ΄ΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
Π°ΡΡ
ΠΈΡΠ΅ΠΊΡΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠ΅Π³ΡΠ»ΡΡΠΈΠ·Π°ΡΠΈΠΈ Π² ΠΌΠΎΠ΄Π΅Π»ΡΡ
, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΡ
Π½Π° ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ°Ρ
Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ. ΠΠ±Π·ΠΎΡ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ, ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΠΌΡΡ
Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ, ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°Π΅Ρ, ΡΡΠΎ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ Π΄ΠΎΡΡΠΈΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², ΡΡΠ°Π²Π½ΠΈΠΌΡΡ
Ρ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ°ΠΌΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡΠΈΡ
ΡΠΊΡΡΡΡΠ΅ ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ ΠΏΡΠΎΡΡΠΎΠΉ ΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠ°ΡΠΈΠΈ ΠΈ Π±ΡΡΡΡΠΎΠΉ ΡΠ°Π±ΠΎΡΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΠΊΠ°ΠΊ ΠΏΡΠΈ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ, ΡΠ°ΠΊ ΠΈ ΠΏΡΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠ΅ ΠΈ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ ΠΈ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ Π΄Π»Ρ ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΡΠ°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄ΡΡΠ³ΠΈΠ΅. ΠΡΠΎΠ²Π΅Π΄Π΅Π½ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠ΅ ΠΎΠΏΠΈΡΠ°Π½Π½ΡΡ
ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠ΅Π² ΠΏΠΎ ΠΊΡΠΈΡΠ΅ΡΠΈΡΠΌ ΠΏΡΠΎΡΡΠΎΡΡ ΠΈ Π΄ΠΎΡΡΡΠΏΠ½ΠΎΡΡΠΈ ΠΈΡ
ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π΄Π»Ρ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ
Towards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which allow training of systems in an end-to-end manner. Such approaches have found success in the area of natural language processing and computer vision community, and have consequently peaked interest in the speech community. The present thesis builds on these recent advances to investigate approaches to develop speech recognition systems in end-to-end manner. In that respect, the thesis follows two main axes of research. The first axis of research focuses on joint learning of features and classifiers for acoustic modeling. The second axis of research focuses on joint modeling of the acoustic model and the decoder. Along the first axis of research, in the framework of hybrid hidden Markov model/artificial neural networks (HMM/ANN) based ASR, we develop a convolution neural networks (CNNs) based acoustic modeling approach that takes raw speech signal as input and estimates phone class conditional probabilities. Specifically, the CNN has several convolution layers (feature stage) followed by multilayer perceptron (classifier stage), which are jointly optimized during the training. Through ASR studies on multiple languages and extensive analysis of the approach, we show that the proposed approach, with minimal prior knowledge, is able to learn automatically the relevant features from the raw speech signal. This approach yields systems that have less number of parameters and achieves better performance, when compared to the conventional approach of cepstral feature extraction followed by classifier training. As the features are automatically learned from the signal, a natural question that arises is: are such systems robust to noise? Towards that we propose a robust CNN approach referred to as normalized CNN approach, which yields systems that are as robust as or better than the conventional ASR systems using cepstral features (with feature level normalizations). The second axis of research focuses on end-to-end sequence-to-sequence conversion. We first propose an end-to-end phoneme recognition system. In this system the relevant features, classifier and the decoder (based on conditional random fields) are jointly modeled during training. We demonstrate the viability of the approach on TIMIT phoneme recognition task. Building on top of that, we investigate a ``weakly supervised'' training that alleviates the necessity for frame level alignments. Finally, we extend the weakly supervised approach to propose a novel keyword spotting technique. In this technique, a CNN first process the input observation sequence to output word level scores, which are subsequently aggregated to detect or spot words. We demonstrate the potential of the approach through a comparative study on LibriSpeech with the standard approach of keyword word spotting based on lattice indexing using ASR system
SynJax: Structured Probability Distributions for JAX
The development of deep learning software libraries enabled significant
progress in the field by allowing users to focus on modeling, while letting the
library to take care of the tedious and time-consuming task of optimizing
execution for modern hardware accelerators. However, this has benefited only
particular types of deep learning models, such as Transformers, whose
primitives map easily to the vectorized computation. The models that explicitly
account for structured objects, such as trees and segmentations, did not
benefit equally because they require custom algorithms that are difficult to
implement in a vectorized form.
SynJax directly addresses this problem by providing an efficient vectorized
implementation of inference algorithms for structured distributions covering
alignment, tagging, segmentation, constituency trees and spanning trees. With
SynJax we can build large-scale differentiable models that explicitly model
structure in the data. The code is available at
https://github.com/deepmind/synjax
- β¦