971 research outputs found
Recommended from our members
A WOz Variant with Contrastive Conditions
We present a variant of the WOz paradigm we refer to as incremental ablation. The new feature involves incrementally restricting the human wizard’s capacities in the direction of a dialog system. We lay out a data collection design with six conditions of user-system and user-wizard interactions that allows us to more precisely identify how to close the communication gap between humans and systems. We describe the application of the method to analysis of contexts in which ASR errors occur, giving us a means to investigate the problem solving strategies humans would resort to if their communication channel were restricted to be more like the machine’s. We describe how we can use the methodology to collect data that is more relevant to a particular learning paradigm involving Markov Decision Processes (MDP)
Exploring miscommunication and collaborative behaviour in human-robot interaction
This paper presents the first step in designing a speech-enabled robot that is capable of natural management of miscommunication. It describes the methods
and results of two WOz studies, in which
dyads of naïve participants interacted in a
collaborative task. The first WOz study
explored human miscommunication
management. The second study investigated
how shared visual space and monitoring
shape the processes of feedback and communication in task-oriented interactions.
The results provide insights for the development of human-inspired and
robust natural language interfaces in robots
Investigating the Effects of Word Substitution Errors on Sentence Embeddings
A key initial step in several natural language processing (NLP) tasks
involves embedding phrases of text to vectors of real numbers that preserve
semantic meaning. To that end, several methods have been recently proposed with
impressive results on semantic similarity tasks. However, all of these
approaches assume that perfect transcripts are available when generating the
embeddings. While this is a reasonable assumption for analysis of written text,
it is limiting for analysis of transcribed text. In this paper we investigate
the effects of word substitution errors, such as those coming from automatic
speech recognition errors (ASR), on several state-of-the-art sentence embedding
methods. To do this, we propose a new simulator that allows the experimenter to
induce ASR-plausible word substitution errors in a corpus at a desired word
error rate. We use this simulator to evaluate the robustness of several
sentence embedding methods. Our results show that pre-trained neural sentence
encoders are both robust to ASR errors and perform well on textual similarity
tasks after errors are introduced. Meanwhile, unweighted averages of word
vectors perform well with perfect transcriptions, but their performance
degrades rapidly on textual similarity tasks for text with word substitution
errors.Comment: 4 Pages, 2 figures. Copyright IEEE 2019. Accepted and to appear in
the Proceedings of the 44th International Conference on Acoustics, Speech,
and Signal Processing 2019 (IEEE-ICASSP-2019), May 12-17 in Brighton, U.K.
Personal use of this material is permitted. However, permission to
reprint/republish this material must be obtained from the IEE
Improving generalisation to new speakers in spoken dialogue state tracking
Users with disabilities can greatly benefit from personalised voice-enabled environmental-control interfaces, but for users with speech impairments (e.g. dysarthria) poor ASR performance poses a challenge to successful dialogue. Statistical dialogue management has shown resilience against high ASR error rates, hence making it useful to improve the performance of these interfaces. However, little research was devoted to dialogue management personalisation to specific users so far. Recently, data driven discriminative models have been shown to yield the best performance in dialogue state tracking (the inference of the user goal from the dialogue history). However, due to the unique characteristics of each speaker, training a system for a new user when user specific data is not available can be challenging due to the mismatch between training and working conditions. This work investigates two methods to improve the performance with new speakers of a LSTM-based personalised state tracker: The use of speaker specific acoustic and ASRrelated features; and dropout regularisation. It is shown that in an environmental control system for dysarthric speakers, the combination of both techniques yields improvements of 3.5% absolute in state tracking accuracy. Further analysis explores the effect of using different amounts of speaker specific data to train the tracking system
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Personalised Dialogue Management for Users with Speech Disorders
Many electronic devices are beginning to include Voice User Interfaces (VUIs) as an alternative to conventional interfaces. VUIs are especially useful for users with restricted upper limb mobility, because they cannot use keyboards and mice. These users, however, often suffer from speech disorders (e.g. dysarthria), making Automatic Speech Recognition (ASR) challenging, thus degrading the performance of the VUI. Partially Observable Markov Decision Process (POMDP) based Dialogue Management (DM) has been shown to improve the interaction performance in challenging ASR environments, but most of the research in this area has focused on Spoken Dialogue Systems (SDSs) developed to provide information, where the users interact with the system only a few times. In contrast, most VUIs are likely to be used by a single speaker over a long period of time, but very little research has been carried out on adaptation of DM models to specific speakers.
This thesis explores methods to adapt DM models (in particular dialogue state tracking models and policy models) to a specific user during a longitudinal interaction. The main differences between personalised VUIs and typical SDSs are identified and studied. Then, state-of-the-art DM models are modified to be used in scenarios which are unique to long-term personalised VUIs, such as personalised models initialised with data from different speakers or scenarios where the dialogue environment (e.g. the ASR) changes over time. In addition, several speaker and environment related features are shown to be useful to improve the interaction performance. This study is done in the context of homeService, a VUI developed to help users with dysarthria to control their home devices. The study shows that personalisation of the POMDP-DM framework can greatly improve the performance of these interfaces
- …