Search CORE

117 research outputs found

Recommended from our members

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition

Author: Chen X
Gales MJF
Liu X
Ragni A
Wang Y
Wong JHM
Publication venue: IEEE/ACM Transactions on Audio Speech and Language Processing
Publication date: 01/01/2019
Field of study

Language modelling is a crucial component in a wide range of applications including speech recognition. Language models (LMs) are usually constructed by splitting a sentence into words and computing the probability of a word based on its word history. This sentence probability calculation, making use of conditional probability distributions, assumes that there is little impact from approximations used in the LMs including: the word history representations; and approaches to handle finite training data. This motivates examining models that make use of additional information from the sentence. In this work future word information, in addition to the history, is used to predict the probability of the current word. For recurrent neural network LMs (RNNLMs) this information can be encapsulated in a bi-directional model. However, if used directly this form of model is computationally expensive when training on large quantities of data, and can be problematic when used with word lattices. This paper proposes a novel neural network language model structure, the succeeding-word RNNLM, su-RNNLM, to address these issues. Instead of using a recurrent unit to capture the complete future word contexts, a feed-forward unit is used to model a fixed finite number of succeeding words. This is more efficient in training than bi-directional models and can be applied to lattice rescoring. The generated lattices can be used for downstream applications, such as confusion network decoding and keyword search. Experimental results on speech recognition and keyword spotting tasks illustrate the empirical usefulness of future word information, and the flexibility of the proposed model to represent this information

Apollo (Cambridge)

White Rose Research Online

Recommended from our members

Log-linear system combination using structured support vector machines

Author: Gales MJF
Knill KM
Ragni A
Yang J
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2016
Field of study

Building high accuracy speech recognition systems with limited language resources is a highly challenging task. Although the use of multi-language data for acoustic models yields improvements, performance is often unsatisfactory with highly limited acoustic training data. In these situations, it is possible to consider using multiple well trained acoustic models and combine the system outputs together. Unfortunately, the computational cost associated with these approaches is high as multiple decoding runs are required. To address this problem, this paper examines schemes based on log-linear score combination. This has a number of advantages over standard combination schemes. Even with limited acoustic training data, it is possible to train, for example, phone-specific combination weights, allowing detailed relationships between the available well trained models to be obtained. To ensure robust parameter estimation, this paper casts log-linear score combination into a structured support vector machine (SSVM) learning task. This yields a method to train model parameters with good generalisation properties. Here the SSVM feature space is a set of scores from well-trained individual systems. The SSVM approach is compared to lattice rescoring and confusion network combination using language packs released within the IARPA Babel program

Apollo (Cambridge)

White Rose Research Online

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Author: Haws David
Kingsbury Brian
Saon George
Shi Jiatong
Watanabe Shinji
Publication venue
Publication date: 02/08/2022
Field of study

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-term memory units (VQ-LSTM) in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks while also producing denser lattices with a very low oracle word error rate (WER) for the same beam size. Additional language model rescoring experiments also demonstrate the effectiveness of the proposed lattice generation scheme.Comment: Interspeech 2022 accepted pape

arXiv.org e-Print Archive

Echolocation: Using Word-Burst Analysis to Rescore Keyword Search Candidates in Low-Resource Languages

Author: Richards Justin
Publication venue: CUNY Academic Works
Publication date: 03/06/2014
Field of study

State of the art technologies for speech recognition are very accurate for heavily studied languages like English. They perform poorly, though, for languages wherein the recorded archives of speech data available to researchers are relatively scant. In the context of these low-resource languages, the task of keyword search within recorded speech is formidable. We demonstrate a method that generates more accurate keyword search results on low-resource languages by studying a pattern not exploited by the speech recognizer. The word-burst, or burstiness, pattern is the tendency for word utterances to appear together in bursts as conversational topics fluctuate. We give evidence that the burstiness phenomenon exhibits itself across varied languages. Using burstiness features to train a machine-learning algorithm, we are able to assess the likelihood that a hypothesized keyword location is correct and adjust its confidence score accordingly, yielding improvements in the efficacy of keyword search in low-resource languages

City University of New York