Search CORE

118 research outputs found

Recommended from our members

Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages

Author: Gales MJF
Knill KM
Ragni A
Wang H
Woodland PC
Zhang C
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2015
Field of study

Copyright © 2015 ISCA. Keyword spotting (KWS) for low-resource languages has drawn increasing attention in recent years. The state-of-the-art KWS systems are based on lattices or Confusion Networks (CN) generated by Automatic Speech Recognition (ASR) systems. It has been shown that considerable KWS gains can be obtained by combining the keyword detection results from different forms of ASR systems, e.g., Tandem and Hybrid systems. This paper investigates an alternative combination scheme for KWS using joint decoding. This scheme treats a Tandem system and a Hybrid system as two separate streams, and makes a linear combination of individual acoustic model log-likelihoods. Joint decoding is more efficient as it requires just a single pass of decoding and a single pass of keyword search. Experiments on six Babel OP2 development languages show that joint decoding is capable of providing consistent gains over each individual system. Moreover, it is possible to efficiently rescore the joint decoding lattices with Tandem or Hybrid acoustic models, and further KWS gains can be obtained by merging the detection posting lists from the joint decoding lattices and rescored lattices

Apollo (Cambridge)

White Rose Research Online

Low-resource speech recognition and keyword-spotting

Author: Gales MJF
Knill KM
Ragni A
Publication venue: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publication date: 01/01/2017
Field of study

© Springer International Publishing AG 2017. The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented

Crossref

Apollo (Cambridge)

White Rose Research Online

Morph-to-word transduction for accurate and efficient automatic speech recognition and keyword search

Author: Gales MJF
Knill KM
Ragni A
Saunders D
Vasilakes J
Zahemszky P
Publication venue: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publication date: 19/06/2017
Field of study

© 2017 IEEE. Word units are a popular choice in statistical language modelling. For inflective and agglutinative languages this choice may result in a high out of vocabulary rate. Subword units, such as morphs, provide an interesting alternative to words. These units can be derived in an unsupervised fashion and empirically show lower out of vocabulary rates. This paper proposes a morph-to-word transduction to convert morph sequences into word sequences. This enables powerful word language models to be applied. In addition, it is expected that techniques such as pruning, confusion network decoding, keyword search and many others may benefit from word rather than morph level decision making. However, word or morph systems alone may not achieve optimal performance in tasks such as keyword search so a combination is typically employed. This paper proposes a single index approach that enables word, morph and phone searches to be performed over a single morph index. Experiments are conducted on IARPA Babel program languages including the surprise languages of the OpenKWS 2015 and 2016 competitions

Crossref

Apollo (Cambridge)

White Rose Research Online

Stimulated training for automatic speech recognition and keyword search in limited resource conditions

Author: Gales MJF
Knill KM
Ragni A
Vasilakes J
Wu C
Publication venue: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publication date: 19/06/2017
Field of study

© 2017 IEEE. Training neural network acoustic models on limited quantities of data is a challenging task. A number of techniques have been proposed to improve generalisation. This paper investigates one such technique called stimulated training. It enables standard criteria such as cross-entropy to enforce spatial constraints on activations originating from different units. Having different regions being active depending on the input unit may help network to discriminate better and as a consequence yield lower error rates. This paper investigates stimulated training for automatic speech recognition of a number of languages representing different families, alphabets, phone sets and vocabulary sizes. In particular, it looks at ensembles of stimulated networks to ensure that improved generalisation will withstand system combination effects. In order to assess stimulated training beyond 1-best transcription accuracy, this paper looks at keyword search as a proxy for assessing quality of lattices. Experiments are conducted on IARPA Babel program languages including the surprise language of OpenKWS 2016 competition

Crossref

Apollo (Cambridge)

White Rose Research Online

End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Author: Cernocky Jan
Saraclar Murat
Yusuf Bolaji
Publication venue
Publication date: 15/08/2023
Field of study

Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 202

arXiv.org e-Print Archive

Improving Interpretability and Regularization in Deep Learning

Author: Gales MJF
Karanasou P
Ragni A
Sim KC
Wu C
Publication venue: IEEE/ACM Transactions on Audio Speech and Language Processing
Publication date: 01/01/2018
Field of study

IEEE Deep learning approaches yield state-of-the-art performance in a range of tasks, including automatic speech recognition. However, the highly distributed representation in a deep neural network (DNN) or other network variations are difficult to analyse, making further parameter interpretation and regularisation challenging. This paper presents a regularisation scheme acting on the activation function output to improve the network interpretability and regularisation. The proposed approach, referred to as activation regularisation, encourages activation function outputs to satisfy a target pattern. By defining appropriate target patterns, different learning concepts can be imposed on the network. This method can aid network interpretability and also has the potential to reduce over-fitting. The scheme is evaluated on several continuous speech recognition tasks: the Wall Street Journal continuous speech recognition task, eight conversational telephone speech tasks from the IARPA Babel program and a U.S. English broadcast news task. On all the tasks, the activation regularisation achieved consistent performance gains over the standard DNN baselines

Crossref

Apollo (Cambridge)

White Rose Research Online

Multi-language neural network language models

Author: Chen X
Dakin E
Gales MJF
Knill KM
Ragni A
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2016
Field of study

Recently there has been a lot of interest in neural network based language models. These models typically consist of vocabulary dependent input and output layers and one or more vocabulary independent hidden layers. One standard issue with these approaches is that large quantities of training data are needed to ensure robust parameter estimates. This poses a significant problem when only limited data is available. One possible way to address this issue is augmentation: model-based, in the form of language model interpolation, and data-based, in the form of data augmentation. However, these approaches may not always be possible to use due to vocabulary dependent input and output layers. This seriously restricts the nature of the data possible to use in augmentation. This paper describes a general solution whereby only one or more vocabulary independent hidden layers are augmented. Such approach makes it possible to examine augmentation from previously impossible domains. Moreover, this approach paves a direct way for multi-task learning with these models. As a proof of the concept this paper examines the use of multilingual data for augmenting hidden layers of recurrent neural network language models. Experiments are conducted using a set of language packs released within IARPA Babel program

Crossref

Apollo (Cambridge)

White Rose Research Online

Very Deep Convolutional Neural Networks for Robust Speech Recognition

Author: Qian Yanmin
Woodland Philip C
Publication venue
Publication date: 02/10/2016
Field of study

This paper describes the extension and optimization of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.Comment: accepted by SLT 201

arXiv.org e-Print Archive

Crossref

Multilingual representations for low resource speech recognition and keyword search

Author: Audhkhasi K
Cui J
Cui X
Gales MJF
Golik P
Kingsbury B
Kislal E
Knill KM
Mangu L
Ney H
Nussbaum-Thom M
Picheny M
Ragni A
Ramabhadran B
Schluter R
Sethy A
Tüske Z
Wang H
Woodland P
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

© 2015 IEEE. This paper examines the impact of multilingual (ML) acoustic representations on Automatic Speech Recognition (ASR) and keyword search (KWS) for low resource languages in the context of the OpenKWS15 evaluation of the IARPA Babel program. The task is to develop Swahili ASR and KWS systems within two weeks using as little as 3 hours of transcribed data. Multilingual acoustic representations proved to be crucial for building these systems under strict time constraints. The paper discusses several key insights on how these representations are derived and used. First, we present a data sampling strategy that can speed up the training of multilingual representations without appreciable loss in ASR performance. Second, we show that fusion of diverse multilingual representations developed at different LORELEI sites yields substantial ASR and KWS gains. Speaker adaptation and data augmentation of these representations improves both ASR and KWS performance (up to 8.7% relative). Third, incorporating un-transcribed data through semi-supervised learning, improves WER and KWS performance. Finally, we show that these multilingual representations significantly improve ASR and KWS performance (relative 9% for WER and 5% for MTWV) even when forty hours of transcribed audio in the target language is available. Multilingual representations significantly contributed to the LORELEI KWS systems winning the OpenKWS15 evaluation

Publikationsserver der RWTH Aachen University

Apollo (Cambridge)

White Rose Research Online

CUED - Cambridge University Engineering Department

Impact of ASR performance on free speaking language assessment

Author: Caines AP
Gales MJF
Knill KM
Kyriakopoulos K
Malinin A
Ragni A
Wang Y
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2018
Field of study

In free speaking tests candidates respond in spontaneous speech to prompts. This form of test allows the spoken language proficiency of a non-native speaker of English to be assessed more fully than read aloud tests. As the candidate's responses are unscripted, transcription by automatic speech recognition (ASR) is essential for automated assessment. ASR will never be 100% accurate so any assessment system must seek to minimise and mitigate ASR errors. This paper considers the impact of ASR errors on the performance of free speaking test auto-marking systems. Firstly rich linguistically related features, based on part-of-speech tags from statistical parse trees, are investigated for assessment. Then, the impact of ASR errors on how well the system can detect whether a learner's answer is relevant to the question asked is evaluated. Finally, the impact that these errors may have on the ability of the system to provide detailed feedback to the learner is analysed. In particular, pronunciation and grammatical errors are considered as these are important in helping a learner to make progress. As feedback resulting from an ASR error would be highly confusing, an approach to mitigate this problem using confidence scores is also analysed

Crossref

Apollo (Cambridge)

White Rose Research Online