Search CORE

659 research outputs found

ASR error management for improving spoken language understanding

Author: Camelin Nathalie
De Mori Renato
Estève Yannick
Ghannay Sahar
Simonnet Edwin
Publication venue
Publication date: 26/05/2017
Field of study

This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .Comment: Interspeech 2017, Aug 2017, Stockholm, Sweden. 201

arXiv.org e-Print Archive

Crossref

Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

Author: Anguita Ortega Jan
Hernando Pericás Francisco Javier
Publication venue: Escola Tècnica Superior d'Enginyers de Telecomunicació de Barcelona
Publication date: 01/01/2005
Field of study

In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have tested the methodin ac1assicalfalse acceptance/false rejection framework and the Equal Error Rate (EER) was measured to be less than 3%.Peer Reviewe

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Spoken Language Intent Detection using Confusion2Vec

Author: Georgiou Panayiotis
Shivakumar Prashanth Gurunath
Yang Mu
Publication venue: 'International Speech Communication Association'
Publication date: 01/07/2019
Field of study

Decoding speaker's intent is a crucial part of spoken language understanding (SLU). The presence of noise or errors in the text transcriptions, in real life scenarios make the task more challenging. In this paper, we address the spoken language intent detection under noisy conditions imposed by automatic speech recognition (ASR) systems. We propose to employ confusion2vec word feature representation to compensate for the errors made by ASR and to increase the robustness of the SLU system. The confusion2vec, motivated from human speech production and perception, models acoustic relationships between words in addition to the semantic and syntactic relations of words in human language. We hypothesize that ASR often makes errors relating to acoustically similar words, and the confusion2vec with inherent model of acoustic relationships between words is able to compensate for the errors. We demonstrate through experiments on the ATIS benchmark dataset, the robustness of the proposed model to achieve state-of-the-art results under noisy ASR conditions. Our system reduces classification error rate (CER) by 20.84% and improves robustness by 37.48% (lower CER degradation) relative to the previous state-of-the-art going from clean to noisy transcripts. Improvements are also demonstrated when training the intent detection models on noisy transcripts

arXiv.org e-Print Archive

Crossref

Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords

Author: Sreenivas Thippur V.
Sundar Harshavardhan
Publication venue
Publication date: 29/04/2015
Field of study

In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82% for detecting both the speakers and their respective keywords, using Student's-t mixture models as speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter

arXiv.org e-Print Archive

Crossref

An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children

Author: Liu X
Ng ML
Wang L
Wu X
Yan N
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

The present project involved the development of a novel interactive speech training system based on virtual reality articulation and examination of the efficacy of the system for hearing impaired (HI) children. Twenty meaningful Mandarin words were presented to the HI children via a 3-D talking head during articulation training. Electromagnetic Articulography (EMA) and graphic transform technology were used to depict movements of various articulators. In addition, speech corpuses were organized in listening and speaking training modules of the system to help improve language skills of the HI children. Accuracy of virtual reality articulatory movement was evaluated through a series of experiments. Finally, a pilot test was performed to train two HI children using the system. Preliminary results showed improvement in speech production by the HI children, and the system was recognized as acceptable and interesting for children. It can be concluded that the training system is effective and valid in articulation training for HI children. © 2013 IEEE.published_or_final_versio

HKU Scholars Hub

What automaticity deficit? Activation of lexical information by readers with dyslexia in a RAN Stroop-switch task

Author: Jones M.W.
Moll K.
Snowling M.
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2015
Field of study

Reading fluency is often predicted by rapid automatized naming (RAN) speed, which as the name implies, measures the automaticity with which familiar stimuli (e.g., letters) can be retrieved and named. Readers with dyslexia are considered to have less "automatized" access to lexical information, reflected in longer RAN times compared with nondyslexic readers. We combined the RAN task with a Stroop-switch manipulation to test the automaticity of dyslexic and nondyslexic readers' lexical access directly within a fluency task. Participants named letters in 10 x 4 arrays while eye movements and speech responses were recorded. Upon fixation, specific letter font colors changed from black to a different color, whereupon the participant was required to rapidly switch from naming the letter to naming the letter color. We could therefore measure reading group differences on "automatic" lexical processing, insofar as it was task-irrelevant. Readers with dyslexia showed obligatory lexical processing and a timeline for recognition that was overall similar to typical readers, but a delay emerged in the output (naming) phase. Further delay was caused by visual-orthographic competition between neighboring stimuli. Our findings outline the specific processes involved when researchers speak of "impaired automaticity" in dyslexic readers' fluency, and are discussed in the context of the broader literature in this field

Crossref

Oxford University Research Archive

Bangor University Research Portal