17,898 research outputs found
Discriminative Segmental Cascades for Feature-Rich Phone Recognition
Discriminative segmental models, such as segmental conditional random fields
(SCRFs) and segmental structured support vector machines (SSVMs), have had
success in speech recognition via both lattice rescoring and first-pass
decoding. However, such models suffer from slow decoding, hampering the use of
computationally expensive features, such as segment neural networks or other
high-order features. A typical solution is to use approximate decoding, either
by beam pruning in a single pass or by beam pruning to generate a lattice
followed by a second pass. In this work, we study discriminative segmental
models trained with a hinge loss (i.e., segmental structured SVMs). We show
that beam search is not suitable for learning rescoring models in this
approach, though it gives good approximate decoding performance when the model
is already well-trained. Instead, we consider an approach inspired by
structured prediction cascades, which use max-marginal pruning to generate
lattices. We obtain a high-accuracy phonetic recognition system with several
expensive feature types: a segment neural network, a second-order language
model, and second-order phone boundary features
Recommended from our members
Efficient lattice rescoring using recurrent neural network language models
This is the accepted manuscript of a paper published in the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, Issue Date: 4-9 May 2014, Written by: Liu, X.; Wang, Y.; Chen, X.; Gales, M.J.F.; Woodland, P.C.).Recurrent neural network language models (RNNLM) have become
an increasingly popular choice for state-of-the-art speech recognition
systems due to their inherently strong generalization performance.
As these models use a vector representation of complete
history contexts, RNNLMs are normally used to rescore N-best lists.
Motivated by their intrinsic characteristics, two novel lattice rescoring
methods for RNNLMs are investigated in this paper. The first
uses an n-gram style clustering of history contexts. The second approach
directly exploits the distance measure between hidden history
vectors. Both methods produced 1-best performance comparable
with a 10k-best rescoring baseline RNNLMsystem on a large vocabulary
conversational telephone speech recognition task. Significant
lattice size compression of over 70% and consistent improvements
after confusion network (CN) decoding were also obtained over the
N-best rescoring approach.The research leading to these results was supported by EPSRC grant
EP/I031022/1 (Natural Speech Technology) and DARPA under the Broad
Operational Language Translation (BOLT) and RATS programs
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Learning Spoken Language Representations with Neural Lattice Language Modeling
Pre-trained language models have achieved huge improvement on many NLP tasks.
However, these methods are usually designed for written text, so they do not
consider the properties of spoken language. Therefore, this paper aims at
generalizing the idea of language model pre-training to lattices generated by
recognition systems. We propose a framework that trains neural lattice language
models to provide contextualized representations for spoken language
understanding tasks. The proposed two-stage pre-training approach reduces the
demands of speech data and has better efficiency. Experiments on intent
detection and dialogue act recognition datasets demonstrate that our proposed
method consistently outperforms strong baselines when evaluated on spoken
inputs. The code is available at https://github.com/MiuLab/Lattice-ELMo.Comment: Published in ACL 2020 as a short pape
Recommended from our members
Two efficient lattice rescoring methods using recurrent neural network language models
An important part of the language modelling problem for automatic speech recognition (ASR) systems, and many other related applications, is to appropriately model long-distance context dependencies in natural languages. Hence, statistical language models (LMs) that can model longer span history contexts, for example, recurrent neural network language models (RNNLMs), have become increasingly popular for state-of-the-art ASR systems. As RNNLMs use a vector representation of complete history contexts, they are normally used to rescore N-best lists. Motivated by their intrinsic characteristics, two efficient lattice rescoring methods for RNNLMs are proposed in this paper. The first method uses an -gram style clustering of history contexts. The second approach directly exploits the distance measure between recurrent hidden history vectors. Both methods produced 1-best performance comparable to a 10 k-best rescoring baseline RNNLM system on two large vocabulary conversational telephone speech recognition tasks for US English and Mandarin Chinese. Consistent lattice size compression and recognition performance improvements after confusion network (CN) decoding were also obtained over the prefix tree structured N-best rescoring approach.This work was supported by EPSRC under Grant EP/I031022/1 (Natural Speech Technology) and DARPA under the Broad Operational Language Translation and RATS programs. The work of X. Chen was supported by Toshiba Research Europe Ltd, Cambridge Research Lab.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TASLP.2016.255882
- …