1,082 research outputs found

    Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model

    Full text link
    Multilingual models for Automatic Speech Recognition (ASR) are attractive as they have been shown to benefit from more training data, and better lend themselves to adaptation to under-resourced languages. However, initialisation from monolingual context-dependent models leads to an explosion of context-dependent states. Connectionist Temporal Classification (CTC) is a potential solution to this as it performs well with monophone labels. We investigate multilingual CTC in the context of adaptation and regularisation techniques that have been shown to be beneficial in more conventional contexts. The multilingual model is trained to model a universal International Phonetic Alphabet (IPA)-based phone set using the CTC loss function. Learning Hidden Unit Contribution (LHUC) is investigated to perform language adaptive training. In addition, dropout during cross-lingual adaptation is also studied and tested in order to mitigate the overfitting problem. Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying LHUC and it is extensible to new phonemes during cross-lingual adaptation. Updating all the parameters shows consistent improvement on limited data. Applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data

    A spoken document retrieval application in the oral history domain

    Get PDF
    The application of automatic speech recognition in the broadcast news domain is well studied. Recognition performance is generally high and accordingly, spoken document retrieval can successfully be applied in this domain, as demonstrated by a number of commercial systems. In other domains, a similar recognition performance is hard to obtain, or even far out of reach, for example due to lack of suitable training material. This is a serious impediment for the successful application of spoken document retrieval techniques for other data then news. This paper outlines our first steps towards a retrieval system that can automatically be adapted to new domains. We discuss our experience with a recently implemented spoken document retrieval application attached to a web-portal that aims at the disclosure of a multimedia data collection in the oral history domain. The paper illustrates that simply deploying an off-theshelf\ud broadcast news system in this task domain will produce error rates that are too high to be useful for retrieval tasks. By applying adaptation techniques on the acoustic level and language model level, system performance can be improved considerably, but additional research on unsupervised adaptation and search interfaces is required to create an adequate search environment based on speech transcripts

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Probabilistic Models of Short and Long Distance Word Dependencies in Running Text

    Get PDF
    This article describes two complementary models that represent dependencies between words in loca/ and non-local contexts. The type of local dependencies considered are sequences of part of speech categories for words. The non-local context of word dependency considered here is that of word recurrence, which is typical in a text. Both are models of phenomena that are to a reasonable extent domain independent, and thus are useful for doing prediction in systems using large vocabularies. Modeling Part of Speech Sequences A common method for modeling local word dependencies is by means of second order Markov models (also known as trigram models). In such a model the context for predicting word wi at position i in a text consists of the two words wi_l, wi-2 that precede it. The model is built from conditional probabilities: P(wi I wi_l, wi-2). The parameters of a part of speech (POS) model are of the form: P(wi [ Ci) x P(Ci [ Ci-1, Ci-2)

    Evolutionary discriminative confidence estimation for spoken term detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11042-011-0913-zSpoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 hours of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.This work was partially supported by the French Ministry of Industry (Innovative Web call) under contract 09.2.93.0966, ‘Collaborative Annotation for Video Accessibility’ (ACAV) and by ‘The Adaptable Ambient Living Assistant’ (ALIAS) project funded through the joint national Ambient Assisted Living (AAL) programme

    Automatic Speech Recognition for Speech Assessment of Persian Preschool Children

    Full text link
    Preschool evaluation is crucial because it gives teachers and parents influential knowledge about children's growth and development. The COVID-19 pandemic has highlighted the necessity of online assessment for preschool children. One of the areas that should be tested is their ability to speak. Employing an Automatic Speech Recognition(ASR) system is useless since they are pre-trained on voices that are different from children's voices in terms of frequency and amplitude. We constructed an ASR for our cognitive test system to solve this issue using the Wav2Vec 2.0 model with a new pre-training objective called Random Frequency Pitch(RFP). In addition, we used our new dataset to fine-tune our model for Meaningless Words(MW) and Rapid Automatic Naming(RAN) tests. Our new approach reaches a Word Error Rate(WER) of 6.45 on the Persian section of the CommonVoice dataset. Furthermore, our novel methodology produces positive outcomes in zero- and few-shot scenarios.Comment: 8 pages, 5 figures, 4 tables, 1 algorith
    • 

    corecore