642 research outputs found

    Stochastic Pronunciation Modelling for Out-of-Vocabulary Spoken Term Detection

    Get PDF
    Spoken term detection (STD) is the name given to the task of searching large amounts of audio for occurrences of spoken terms, which are typically single words or short phrases. One reason that STD is a hard task is that search terms tend to contain a disproportionate number of out-of-vocabulary (OOV) words. The most common approach to STD uses subword units. This, in conjunction with some method for predicting pronunciations of OOVs from their written form, enables the detection of OOV terms but performance is considerably worse than for in-vocabulary terms. This performance differential can be largely attributed to the special properties of OOVs. One such property is the high degree of uncertainty in the pronunciation of OOVs. We present a stochastic pronunciation model (SPM) which explicitly deals with this uncertainty. The key insight is to search for all possible pronunciations when detecting an OOV term, explicitly capturing the uncertainty in pronunciation. This requires a probabilistic model of pronunciation, able to estimate a distribution over all possible pronunciations. We use a joint-multigram model (JMM) for this and compare the JMM-based SPM with the conventional soft match approach. Experiments using speech from the meetings domain demonstrate that the SPM performs better than soft match in most operating regions, especially at low false alarm probabilities. Furthermore, SPM and soft match are found to be complementary: their combination provides further performance gains

    Out-of-vocabulary spoken term detection

    Get PDF
    Spoken term detection (STD) is a fundamental task for multimedia information retrieval. A major challenge faced by an STD system is the serious performance reduction when detecting out-of-vocabulary (OOV) terms. The difficulties arise not only from the absence of pronunciations for such terms in the system dictionaries, but from intrinsic uncertainty in pronunciations, significant diversity in term properties and a high degree of weakness in acoustic and language modelling. To tackle the OOV issue, we first applied the joint-multigram model to predict pronunciations for OOV terms in a stochastic way. Based on this, we propose a stochastic pronunciation model that considers all possible pronunciations for OOV terms so that the high pronunciation uncertainty is compensated for. Furthermore, to deal with the diversity in term properties, we propose a termdependent discriminative decision strategy, which employs discriminative models to integrate multiple informative factors and confidence measures into a classification probability, which gives rise to minimum decision cost. In addition, to address the weakness in acoustic and language modelling, we propose a direct posterior confidence measure which replaces the generative models with a discriminative model, such as a multi-layer perceptron (MLP), to obtain a robust confidence for OOV term detection. With these novel techniques, the STD performance on OOV terms was improved substantially and significantly in our experiments set on meeting speech data

    Advances in deep learning methods for speech recognition and understanding

    Full text link
    Ce travail expose plusieurs Ă©tudes dans les domaines de la reconnaissance de la parole et comprĂ©hension du langage parlĂ©. La comprĂ©hension sĂ©mantique du langage parlĂ© est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intĂ©resse depuis longtemps les chercheurs, puisque la parole est une des charactĂ©ristiques qui definit l'ĂȘtre humain. Avec le dĂ©veloppement du rĂ©seau neuronal artificiel, le domaine a connu une Ă©volution rapide Ă  la fois en terme de prĂ©cision et de perception humaine. Une autre Ă©tape importante a Ă©tĂ© franchie avec le dĂ©veloppement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modĂšle, ce qui augmente ainsi les performances, et ce qui simplifie la procĂ©dure d'entrainement. Les modĂšles de bout en bout sont devenus rĂ©alisables avec la quantitĂ© croissante de donnĂ©es disponibles, de ressources informatiques et, surtout, avec de nombreux dĂ©veloppements architecturaux innovateurs. NĂ©anmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des donnĂ©es difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variĂ©tĂ© de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composĂ© de diffĂ©rents bruits inconnus, comme une tĂąche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique Ă  l'Ă©poque de l'adaptation du domaine antagoniste. En rĂ©sumĂ©, ces travaux antĂ©rieurs proposaient de former des caractĂ©ristiques de maniĂšre Ă  ce qu'elles soient distinctives pour la tĂąche principale, mais non-distinctive pour la tĂąche secondaire. Cette tĂąche secondaire est conçue pour ĂȘtre la tĂąche de reconnaissance de domaine. Ainsi, les fonctionnalitĂ©s entraĂźnĂ©es sont invariantes vis-Ă -vis du domaine considĂ©rĂ©. Dans notre travail, nous adoptons cette technique et la modifions pour la tĂąche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous dĂ©veloppons une mĂ©thode gĂ©nĂ©rale pour la rĂ©gularisation des rĂ©seaux gĂ©nĂ©ratif rĂ©currents. Il est connu que les rĂ©seaux rĂ©currents ont souvent des difficultĂ©s Ă  rester sur le mĂȘme chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des rĂ©seaux bidirectionnels pour une meilleure traitement de sĂ©quences pour l'apprentissage des charactĂ©ristiques, qui n'est pas applicable au cas gĂ©nĂ©ratif. Nous avons dĂ©veloppĂ© un moyen d'amĂ©liorer la cohĂ©rence de la production de longues sĂ©quences avec des rĂ©seaux rĂ©currents. Nous proposons un moyen de construire un modĂšle similaire Ă  un rĂ©seau bidirectionnel. L'idĂ©e centrale est d'utiliser une perte L2 entre les rĂ©seaux rĂ©currents gĂ©nĂ©ratifs vers l'avant et vers l'arriĂšre. Nous fournissons une Ă©valuation expĂ©rimentale sur une multitude de tĂąches et d'ensembles de donnĂ©es, y compris la reconnaissance vocale, le sous-titrage d'images et la modĂ©lisation du langage. Dans le troisiĂšme article, nous Ă©tudions la possibilitĂ© de dĂ©velopper un identificateur d'intention de bout en bout pour la comprĂ©hension du langage parlĂ©. La comprĂ©hension sĂ©mantique du langage parlĂ© est une Ă©tape importante vers le dĂ©veloppement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances Ă©levĂ©es sur les tĂąches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antĂ©rieurs pour dĂ©velopper un systĂšme de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

    Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

    Full text link
    In this paper, we propose a multilingual query-by-example keyword spotting (KWS) system based on a residual neural network. The model is trained as a classifier on a multilingual keyword dataset extracted from Common Voice sentences and fine-tuned using circle loss. We demonstrate the generalization ability of the model to new languages and report a mean reduction in EER of 59.2 % for previously seen and 47.9 % for unseen languages compared to a competitive baseline. We show that the word embeddings learned by the KWS model can be accurately predicted from the phoneme sequences using a simple LSTM model. Our system achieves a promising accuracy for streaming keyword spotting and keyword search on Common Voice audio using just 5 examples per keyword. Experiments on the Hey-Snips dataset show a good performance with a false negative rate of 5.4 % at only 0.1 false alarms per hour.Comment: Accepted to ICASSP 202

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF

    Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN

    Full text link
    Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the 10th10^{th} most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin
    • 

    corecore