116 research outputs found
Recommended from our members
Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition
Language modelling is a crucial component in a wide range of applications including speech recognition. Language models (LMs) are usually constructed by splitting a sentence into words and computing the probability of a word based on its word history. This sentence probability calculation, making use of conditional probability distributions, assumes that there is little impact from approximations used in the LMs including:
the word history representations; and approaches to handle finite training data. This motivates examining models that make use of additional information from the sentence. In this work future word information, in addition to the history, is used to predict the probability of the current word. For recurrent neural network LMs (RNNLMs) this information can be encapsulated in a bi-directional model. However, if used directly this form
of model is computationally expensive when training on large quantities of data, and can be problematic when used with word lattices. This paper proposes a novel neural network language model structure, the succeeding-word RNNLM, su-RNNLM, to address these issues. Instead of using a recurrent unit to capture the complete future word contexts, a feed-forward unit is used to model a fixed finite number of succeeding words. This is more efficient in training than bi-directional models and can be applied to lattice rescoring. The generated lattices can be used for downstream applications, such as confusion network decoding and keyword search. Experimental results on speech recognition and keyword spotting tasks illustrate the empirical usefulness of future word information, and the flexibility of the proposed model to represent this information
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Beam search, which is the dominant ASR decoding algorithm for end-to-end
models, generates tree-structured hypotheses. However, recent studies have
shown that decoding with hypothesis merging can achieve a more efficient search
with comparable or better performance. But, the full context in recurrent
networks is not compatible with hypothesis merging. We propose to use
vector-quantized long short-term memory units (VQ-LSTM) in the prediction
network of RNN transducers. By training the discrete representation jointly
with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN
transducers improve ASR performance over transducers with regular prediction
networks while also producing denser lattices with a very low oracle word error
rate (WER) for the same beam size. Additional language model rescoring
experiments also demonstrate the effectiveness of the proposed lattice
generation scheme.Comment: Interspeech 2022 accepted pape
Log-linear system combination using structured support vector machines
Building high accuracy speech recognition systems with limited language resources is a highly challenging task. Although the use of multi-language data for acoustic models yields improvements, performance is often unsatisfactory with highly limited acoustic training data. In these situations, it is possible to consider using multiple well trained acoustic models and combine the system outputs together. Unfortunately, the computational cost associated with these approaches is high as multiple decoding runs are required. To address this problem, this paper examines schemes based on log-linear score combination. This has a number of advantages over standard combination schemes. Even with limited acoustic training data, it is possible to train, for example, phone-specific combination weights, allowing detailed relationships between the available well
trained models to be obtained. To ensure robust parameter estimation, this paper casts log-linear score combination into a structured support vector machine (SSVM) learning task. This yields a method to train model parameters with good generalisation properties. Here the SSVM feature space is a set of scores from well-trained individual systems. The SSVM approach is compared to lattice rescoring and confusion network combination using language packs released within the IARPA Babel program
Adaptation and Augmentation: Towards Better Rescoring Strategies for Automatic Speech Recognition and Spoken Term Detection
Selecting the best prediction from a set of candidates is an essential problem for many spoken language processing tasks, including automatic speech recognition (ASR) and spoken keyword spotting (KWS). Generally, the selection is determined by a confidence score assigned to each candidate. Calibrating these confidence scores (i.e., rescoring them) could make better selections and improve the system performance. This dissertation focuses on using tailored language models to rescore ASR hypotheses as well as keyword search results for ASR-based KWS.
This dissertation introduces three kinds of rescoring techniques: (1) Freezing most model parameters while fine-tuning the output layer in order to adapt neural network language models (NNLMs) from the written domain to the spoken domain. Experiments on a large-scale Italian corpus show a 30.2% relative reduction in perplexity at the word-cluster level and a 2.3% relative reduction in WER in a state-of-the-art Italian ASR system. (2) Incorporating source application information associated with speech queries. By exploring a range of adaptation model architectures, we achieve a 21.3% relative reduction in perplexity compared to a fine-tuned baseline. Initial experiments using a state-of-the-art Italian ASR system show a 3.0% relative reduction in WER on top of an unadapted 5-gram LM. In addition, human evaluations show significant improvements by using the source application information. (3) Marrying machine learning algorithms (classification and ranking) with a variety of signals to rescore keyword search results in the context of KWS for low-resource languages. These systems, built for the IARPA BABEL Program, enhance search performance in terms of maximum term-weighted value (MTWV) across six different low-resource languages: Vietnamese, Tagalog, Pashto, Turkish, Zulu and Tamil
Recommended from our members
Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages
Copyright © 2015 ISCA. Keyword spotting (KWS) for low-resource languages has drawn increasing attention in recent years. The state-of-the-art KWS systems are based on lattices or Confusion Networks (CN) generated by Automatic Speech Recognition (ASR) systems. It has been shown that considerable KWS gains can be obtained by combining the keyword detection results from different forms of ASR systems, e.g., Tandem and Hybrid systems. This paper investigates an alternative combination scheme for KWS using joint decoding. This scheme treats a Tandem system and a Hybrid system as two separate streams, and makes a linear combination of individual acoustic model log-likelihoods. Joint decoding is more efficient as it requires just a single pass of decoding and a single pass of keyword search. Experiments on six Babel OP2 development languages show that joint decoding is capable of providing consistent gains over each individual system. Moreover, it is possible to efficiently rescore the joint decoding lattices with Tandem or Hybrid acoustic models, and further KWS gains can be obtained by merging the detection posting lists from the joint decoding lattices and rescored lattices
Future word contexts in neural network language models
Recently, bidirectional recurrent network language models (bi-RNNLMs) have
been shown to outperform standard, unidirectional, recurrent neural network
language models (uni-RNNLMs) on a range of speech recognition tasks. This
indicates that future word context information beyond the word history can be
useful. However, bi-RNNLMs pose a number of challenges as they make use of the
complete previous and future word context information. This impacts both
training efficiency and their use within a lattice rescoring framework. In this
paper these issues are addressed by proposing a novel neural network structure,
succeeding word RNNLMs (su-RNNLMs). Instead of using a recurrent unit to
capture the complete future word contexts, a feedforward unit is used to model
a finite number of succeeding, future, words. This model can be trained much
more efficiently than bi-RNNLMs and can also be used for lattice rescoring.
Experimental results on a meeting transcription task (AMI) show the proposed
model consistently outperformed uni-RNNLMs and yield only a slight degradation
compared to bi-RNNLMs in N-best rescoring. Additionally, performance
improvements can be obtained using lattice rescoring and subsequent confusion
network decoding
- …