967 research outputs found
Automatic Speech Recognition Errors Detection Using Supervised Learning Techniques
Over the last years, many advances have been made in the field of Automatic Speech Recognition (ASR). However, the persistent presence of ASR errors is limiting the widespread adoption of speech technology in real life applications. This motivates the attempts to find alternative techniques to automatically detect and correct ASR errors, which can be very effective and especially when the user does not have access to tune the features, the models or the decoder of the ASR system or when the transcription serves as input to downstream systems like machine translation, information retrieval, and question answering. In this paper, we present an ASR errors detection system targeted towards substitution and insertion errors. The proposed system is based on supervised learning techniques and uses input features deducted only from the ASR output words and hence should be usable with any ASR system. Applying this system on TV program transcription data leads to identify 40.30% of the recognition errors generated by the ASR system
Fast Keyword Spotting in Telephone Speech
In the paper, we present a system designed for detecting keywords in telephone speech. We focus not only on achieving high accuracy but also on very short processing time. The keyword spotting system can run in three modes: a) an off-line mode requiring less than 0.1xRT, b) an on-line mode with minimum (2 s) latency, and c) a repeated spotting mode, in which pre-computed values allow for additional acceleration. Its performance is evaluated on recordings of Czech spontaneous telephone speech using rather large and complex keyword lists
Recommended from our members
uC: Ubiquitous Collaboration Platform for Multimodal Team Interaction Support
A human-centered computing platform that improves teamwork and transforms the “human- computer interaction experience” for distributed teams is presented. This Ubiquitous Collaboration, or uC (“you see”), platform\u27s objective is to transform distributed teamwork (i.e., work occurring when teams of workers and learners are geographically dispersed and often interacting at different times). It achieves this goal through a multimodal team interaction interface realized through a reconfigurable open architecture. The approach taken is to integrate: (1) an intuitive speech- and video-centric multi-modal interface to augment more conventional methods (e.g., mouse, stylus and touch), (2) an open and reconfigurable architecture supporting information gathering, and (3) a machine intelligent approach to analysis and management of heterogeneous live and stored sensor data to support collaboration. The system will transform how teams of people interact with computers by drawing on both the virtual and physical environment
Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals
Human infants can discover words directly from unsegmented speech signals
without any explicitly labeled data. In this paper, we develop a novel machine
learning method called nonparametric Bayesian double articulation analyzer
(NPB-DAA) that can directly acquire language and acoustic models from observed
continuous speech signals. For this purpose, we propose an integrative
generative model that combines a language model and an acoustic model into a
single generative model called the "hierarchical Dirichlet process hidden
language model" (HDP-HLM). The HDP-HLM is obtained by extending the
hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by
Johnson et al. An inference procedure for the HDP-HLM is derived using the
blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure
enables the simultaneous and direct inference of language and acoustic models
from continuous speech signals. Based on the HDP-HLM and its inference
procedure, we developed a novel double articulation analyzer. By assuming
HDP-HLM as a generative model of observed time series data, and by inferring
latent variables of the model, the method can analyze latent double
articulation structure, i.e., hierarchically organized latent words and
phonemes, of the data in an unsupervised manner. The novel unsupervised double
articulation analyzer is called NPB-DAA.
The NPB-DAA can automatically estimate double articulation structure embedded
in speech signals. We also carried out two evaluation experiments using
synthetic data and actual human continuous speech signals representing Japanese
vowel sequences. In the word acquisition and phoneme categorization tasks, the
NPB-DAA outperformed a conventional double articulation analyzer (DAA) and
baseline automatic speech recognition system whose acoustic model was trained
in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on
Autonomous Mental Development (TAMD
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
ASR Error Detection via Audio-Transcript entailment
Despite improved performances of the latest Automatic Speech Recognition
(ASR) systems, transcription errors are still unavoidable. These errors can
have a considerable impact in critical domains such as healthcare, when used to
help with clinical documentation. Therefore, detecting ASR errors is a critical
first step in preventing further error propagation to downstream applications.
To this end, we propose a novel end-to-end approach for ASR error detection
using audio-transcript entailment. To the best of our knowledge, we are the
first to frame this problem as an end-to-end entailment task between the audio
segment and its corresponding transcript segment. Our intuition is that there
should be a bidirectional entailment between audio and transcript when there is
no recognition error and vice versa. The proposed model utilizes an acoustic
encoder and a linguistic encoder to model the speech and transcript
respectively. The encoded representations of both modalities are fused to
predict the entailment. Since doctor-patient conversations are used in our
experiments, a particular emphasis is placed on medical terms. Our proposed
model achieves classification error rates (CER) of 26.2% on all transcription
errors and 23% on medical errors specifically, leading to improvements upon a
strong baseline by 12% and 15.4%, respectively.Comment: Accepted to Interspeech 202
- …