514 research outputs found
Discriminative Segmental Cascades for Feature-Rich Phone Recognition
Discriminative segmental models, such as segmental conditional random fields
(SCRFs) and segmental structured support vector machines (SSVMs), have had
success in speech recognition via both lattice rescoring and first-pass
decoding. However, such models suffer from slow decoding, hampering the use of
computationally expensive features, such as segment neural networks or other
high-order features. A typical solution is to use approximate decoding, either
by beam pruning in a single pass or by beam pruning to generate a lattice
followed by a second pass. In this work, we study discriminative segmental
models trained with a hinge loss (i.e., segmental structured SVMs). We show
that beam search is not suitable for learning rescoring models in this
approach, though it gives good approximate decoding performance when the model
is already well-trained. Instead, we consider an approach inspired by
structured prediction cascades, which use max-marginal pruning to generate
lattices. We obtain a high-accuracy phonetic recognition system with several
expensive feature types: a segment neural network, a second-order language
model, and second-order phone boundary features
End-to-end neural segmental models for speech recognition
Segmental models are an alternative to frame-based models for sequence
prediction, where hypothesized path weights are based on entire segment scores
rather than a single frame at a time. Neural segmental models are segmental
models that use neural network-based weight functions. Neural segmental models
have achieved competitive results for speech recognition, and their end-to-end
training has been explored in several studies. In this work, we review neural
segmental models, which can be viewed as consisting of a neural network-based
acoustic encoder and a finite-state transducer decoder. We study end-to-end
segmental models with different weight functions, including ones based on
frame-level neural classifiers and on segmental recurrent neural networks. We
study how reducing the search space size impacts performance under different
weight functions. We also compare several loss functions for end-to-end
training. Finally, we explore training approaches, including multi-stage vs.
end-to-end training and multitask training that combines segmental and
frame-level losses
Segmental Recurrent Neural Networks for End-to-end Speech Recognition
We study the segmental recurrent neural network for end-to-end acoustic
modelling. This model connects the segmental conditional random field (CRF)
with a recurrent neural network (RNN) used for feature extraction. Compared to
most previous CRF-based acoustic models, it does not rely on an external system
to provide features or segmentation boundaries. Instead, this model
marginalises out all the possible segmentations, and features are extracted
from the RNN trained together with the segmental CRF. In essence, this model is
self-contained and can be trained end-to-end. In this paper, we discuss
practical training and decoding issues as well as the method to speed up the
training in the context of speech recognition. We performed experiments on the
TIMIT dataset. We achieved 17.3 phone error rate (PER) from the first-pass
decoding --- the best reported result using CRFs, despite the fact that we only
used a zeroth-order CRF and without using any language model.Comment: 5 pages, 2 figures, accepted by Interspeech 201
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
- …