Search CORE

35,892 research outputs found

Simultaneous Multispeaker Segmentation for Automatic Meeting Recognition

Author: Fügen Christian
Laskowski Kornel
Schultz Tanja
Publication venue
Publication date: 03/09/2007
Field of study

Vocal activity detection is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, and standard vocal activity detection algorithms for close-talk microphones have shown to be ineffective. This is primarily due to the problem of crosstalk, in which a participant’s speech appears on other participants ’ microphones, making it hard to attribute detected speech to its correct speaker. We describe an automatic multichannel segmentation system for meeting recognition, which accounts for both the observed acoustics and the inferred vocal activity states of all participants using joint multi-participant models. Our experiments show that this approach almost completely eliminates the crosstalk problem. Recent improvements to the baseline reduce the development set word error rate, achieved by a state-of-theart multi-pass speech recognition system, by 62 % relative to manual segmentation. We also observe significant performance improvements on unseen data

CiteSeerX

KITopen

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ASR error management for improving spoken language understanding

Author: Camelin Nathalie
De Mori Renato
Estève Yannick
Ghannay Sahar
Simonnet Edwin
Publication venue
Publication date: 26/05/2017
Field of study

This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .Comment: Interspeech 2017, Aug 2017, Stockholm, Sweden. 201

arXiv.org e-Print Archive

Crossref

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

Spoken Language Intent Detection using Confusion2Vec

Author: Georgiou Panayiotis
Shivakumar Prashanth Gurunath
Yang Mu
Publication venue: 'International Speech Communication Association'
Publication date: 01/07/2019
Field of study

Decoding speaker's intent is a crucial part of spoken language understanding (SLU). The presence of noise or errors in the text transcriptions, in real life scenarios make the task more challenging. In this paper, we address the spoken language intent detection under noisy conditions imposed by automatic speech recognition (ASR) systems. We propose to employ confusion2vec word feature representation to compensate for the errors made by ASR and to increase the robustness of the SLU system. The confusion2vec, motivated from human speech production and perception, models acoustic relationships between words in addition to the semantic and syntactic relations of words in human language. We hypothesize that ASR often makes errors relating to acoustically similar words, and the confusion2vec with inherent model of acoustic relationships between words is able to compensate for the errors. We demonstrate through experiments on the ATIS benchmark dataset, the robustness of the proposed model to achieve state-of-the-art results under noisy ASR conditions. Our system reduces classification error rate (CER) by 20.84% and improves robustness by 37.48% (lower CER degradation) relative to the previous state-of-the-art going from clean to noisy transcripts. Improvements are also demonstrated when training the intent detection models on noisy transcripts

arXiv.org e-Print Archive

Crossref

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

Author: Andreas Stolcke
Dilek Hakkani-Tür
Elizabeth Shriberg
Grosz B.
Gökhan Tür
Hearst Marti A
Passonneau Rebecca J
Publication venue
Publication date: 01/01/2000
Field of study

We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

Improving the translation environment for professional translators

Author: Augustinus Liesbeth
Bulté Bram
Buysschaert Joost
Coppers Sven
Daems Joke
Heyman Geert
Hoste Veronique
Lefever Els
Luyten Kris
Macken Lieve
Moens Marie-Francine
Pelemans Joris
Rigouts Terryn Ayla
Steurs Frieda
Tezcan Arda
Van den Bergh Jan
van der Lek-Ciudin Iulianna
Van Eynde Frank
Vanallemeersch Tom
Vandeghinste Vincent
Verwimp Lyan
Wambacq Patrick
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

Multidisciplinary Digital Publishing Institute

Ghent University Academic Bibliography

Access to recorded interviews: A research agenda

Author: Heeren W.F.L.
Jong F.M.G. de
Oard D.W.
Ordelman R.J.F.
Publication venue: ACM
Publication date: 01/01/2008
Field of study

Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed

University of Twente Research Information