592 research outputs found
Robust Spoken Language Understanding for House Service Robots
Service robotics has been growing significantly in thelast years, leading to several research results and to a numberof consumer products. One of the essential features of theserobotic platforms is represented by the ability of interactingwith users through natural language. Spoken commands canbe processed by a Spoken Language Understanding chain, inorder to obtain the desired behavior of the robot. The entrypoint of such a process is represented by an Automatic SpeechRecognition (ASR) module, that provides a list of transcriptionsfor a given spoken utterance. Although several well-performingASR engines are available off-the-shelf, they operate in a generalpurpose setting. Hence, they may be not well suited in therecognition of utterances given to robots in specific domains. Inthis work, we propose a practical yet robust strategy to re-ranklists of transcriptions. This approach improves the quality of ASRsystems in situated scenarios, i.e., the transcription of roboticcommands. The proposed method relies upon evidences derivedby a semantic grammar with semantic actions, designed tomodel typical commands expressed in scenarios that are specificto human service robotics. The outcomes obtained throughan experimental evaluation show that the approach is able toeffectively outperform the ASR baseline, obtained by selectingthe first transcription suggested by the AS
Automatic Quality Estimation for ASR System Combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for
system combination in automatic speech recognition (ASR). In order to select
the most appropriate words to insert at each position in the output
transcriptions, some ROVER extensions rely on critical information such as
confidence scores and other ASR decoder features. This information, which is
not always available, highly depends on the decoding process and sometimes
tends to over estimate the real quality of the recognized words. In this paper
we propose a novel variant of ROVER that takes advantage of ASR quality
estimation (QE) for ranking the transcriptions at "segment level" instead of:
i) relying on confidence scores, or ii) feeding ROVER with randomly ordered
hypotheses. We first introduce an effective set of features to compensate for
the absence of ASR decoder information. Then, we apply QE techniques to perform
accurate hypothesis ranking at segment-level before starting the fusion
process. The evaluation is carried out on two different tasks, in which we
respectively combine hypotheses coming from independent ASR systems and
multi-microphone recordings. In both tasks, it is assumed that the ASR decoder
information is not available. The proposed approach significantly outperforms
standard ROVER and it is competitive with two strong oracles that e xploit
prior knowledge about the real quality of the hypotheses to be combined.
Compared to standard ROVER, the abs olute WER improvements in the two
evaluation scenarios range from 0.5% to 7.3%
DNN-Based Semantic Model for Rescoring N-best Speech Recognition List
The word error rate (WER) of an automatic speech recognition (ASR) system
increases when a mismatch occurs between the training and the testing
conditions due to the noise, etc. In this case, the acoustic information can be
less reliable. This work aims to improve ASR by modeling long-term semantic
relations to compensate for distorted acoustic features. We propose to perform
this through rescoring of the ASR N-best hypotheses list. To achieve this, we
train a deep neural network (DNN). Our DNN rescoring model is aimed at
selecting hypotheses that have better semantic consistency and therefore lower
WER. We investigate two types of representations as part of input features to
our DNN model: static word embeddings (from word2vec) and dynamic contextual
embeddings (from BERT). Acoustic and linguistic features are also included. We
perform experiments on the publicly available dataset TED-LIUM mixed with real
noise. The proposed rescoring approaches give significant improvement of the
WER over the ASR system without rescoring models in two noisy conditions and
with n-gram and RNNLM
- …