13 research outputs found
Generative probabilistic models of goal-directed users in task-oriented dialogs
A longstanding objective of human-computer interaction research is to develop
better dialog systems for end users. The subset of user modelling research specifically,
aims to provide dialog researchers with models of user behaviour to aid with the
design and improvement of dialog systems. Where dialog systems are commercially
deployed, they are often to be used by a vast number of users, where sub-optimal performance
could lead to an immediate financial loss for the service provider, and even
user alienation. Thus, there is a strong incentive to make dialog systems as functional
as possible immediately, and crucially prior to their release to the public. Models of
user behaviour fill this gap, by simulating the role of human users in the lab, without
the losses associated with sub-optimal system performance. User models can also
tremendously aid design decisions, by serving as tools for exploratory analysis of real
user behaviour, prior to designing dialog software.
User modelling is the central problem of this thesis. We focus on a particular
kind of dialogs termed task-oriented dialogs (those centred around solving an explicit
task) because they represent the frontier of current dialog research and commercial
deployment. Users taking part in these dialogs behave according to a set of user goals,
which specify what they wish to accomplish from the interaction, and tend to exhibit
variability of behaviour given the same set of goals. Our objective is to capture and
reproduce (at the semantic utterance level) the range of behaviour that users exhibit
while being consistent with their goals.
We approach the problem as an instance of generative probabilistic modelling, with
explicit user goals, and induced entirely from data. We argue that doing so has numerous
practical and theoretical benefits over previous approaches to user modelling
which have either lacked a model of user goals, or have been not been driven by real
dialog data. A principal problem with user modelling development thus far has been
the difficulty in evaluation. We demonstrate how treating user models as probabilistic
models alleviates some of these problems through the ability to leverage a whole raft
of techniques and insights from machine learning for evaluation.
We demonstrate the efficacy of our approach by applying it to two different kinds of
task-oriented dialog domains, which exhibit two different sub-problems encountered
in real dialog corpora. The first are informational (or slot-filling) domains, specifically
those concerning flight and bus route information. In slot-filling domains, user goals
take categorical values which allow multiple surface realisations, and are corrupted by
speech recognition errors. We address this issue by adopting a topic model representation
of user goals which allows us capture both synonymy and phonetic confusability
in a unified model. We first evaluate our model intrinsically using held-out probability
and perplexity, and demonstrate substantial gains over an alternative string-goal
representations, and over a non-goal-directed model. We then show in an extrinsic
evaluation that features derived from our model lead to substantial improvements over
strong baseline in the task of discriminating between real dialogs (consistent dialogs)
and dialogs comprised of real turns sampled from different dialogs (inconsistent dialogs).
We then move on to a spatial navigational domain in which user goals are spatial
trajectories across a landscape. The disparity between the representation of spatial
routes as raw pixel coordinates and their grounding as semantic utterances creates
an interesting challenge compared to conventional slot-filling domains. We derive a
feature-based representation of spatial goals which facilitates reasoning and admits
generalisation to new routes not encountered at training time. The probabilistic formulation
of our model allows us to capture variability of behaviour given the same
underlying goal, a property frequently exhibited by human users in the domain. We
first evaluate intrinsically using held-out probability and perplexity, and find a substantial
reduction in uncertainty brought by our spatial representation. We further evaluate
extrinsically in a human judgement task and find that our model’s behaviour does not
differ significantly from the behaviour of real users.
We conclude by sketching two novel ideas for future work: the first is to deploy
the user models as transition functions for MDP-based dialog managers; the second is
to use the models as a means of restricting the search space for optimal policies, by
treating optimal behaviour as a subset of the (distributions over) plausible behaviour
which we have induced
Generative Goal-driven User Simulation for Dialog Management
User simulation is frequently used to train statistical dialog managers for task-oriented domains. At present, goal-driven simulators (those that have a persistent notion of what they wish to achieve in the dialog) require some task-specific engineering, making them impossible to evaluate intrinsically. Instead, they have been evaluated extrinsically by means of the dialog managers they are intended to train, leading to circularity of argument. In this paper, we propose the first fully generative goal-driven simulator that is fully induced from data, without hand-crafting or goal annotation. Our goals are latent, and take the form of topics in a topic model, clustering together semantically equivalent and phonetically confusable strings, implicitly modelling synonymy and speech recognition noise. We evaluate on two standard dialog resources, the Communicator and Let’s Go datasets, and demonstrate that our model has substantially better fit to held out data than competing approaches. We also show that features derived from our model allow significantly greater improvement over a baseline at distinguishing real from randomly permuted dialogs.
Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech
Ultrasound tongue imaging (UTI) provides a convenient way to visualize the
vocal tract during speech production. UTI is increasingly being used for speech
therapy, making it important to develop automatic methods to assist various
time-consuming manual tasks currently performed by speech therapists. A key
challenge is to generalize the automatic processing of ultrasound tongue images
to previously unseen speakers. In this work, we investigate the classification
of phonetic segments (tongue shapes) from raw ultrasound recordings under
several training scenarios: speaker-dependent, multi-speaker,
speaker-independent, and speaker-adapted. We observe that models underperform
when applied to data from speakers not seen at training time. However, when
provided with minimal additional speaker information, such as the mean
ultrasound frame, the models generalize better to unseen speakers.Comment: 5 pages, 4 figures, published in ICASSP2019 (IEEE International
Conference on Acoustics, Speech and Signal Processing, 2019
Silent versus modal multi-speaker speech recognition from ultrasound and video
We investigate multi-speaker speech recognition from ultrasound images of the
tongue and video images of the lips. We train our systems on imaging data from
modal speech, and evaluate on matched test sets of two speaking modes: silent
and modal speech. We observe that silent speech recognition from imaging data
underperforms compared to modal speech recognition, likely due to a
speaking-mode mismatch between training and testing. We improve silent speech
recognition performance using techniques that address the domain mismatch, such
as fMLLR and unsupervised model adaptation. We also analyse the properties of
silent and modal speech in terms of utterance duration and the size of the
articulatory space. To estimate the articulatory space, we compute the convex
hull of tongue splines, extracted from ultrasound tongue images. Overall, we
observe that the duration of silent speech is longer than that of modal speech,
and that silent speech covers a smaller articulatory space than modal speech.
Although these two properties are statistically significant across speaking
modes, they do not directly correlate with word error rates from speech
recognition.Comment: 5 pages, 5 figures, Submitted to Interspeech 202
Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions
We investigate the automatic processing of child speech therapy sessions
using ultrasound visual biofeedback, with a specific focus on complementing
acoustic features with ultrasound images of the tongue for the tasks of speaker
diarization and time-alignment of target words. For speaker diarization, we
propose an ultrasound-based time-domain signal which we call estimated tongue
activity. For word-alignment, we augment an acoustic model with low-dimensional
representations of ultrasound images of the tongue, learned by a convolutional
neural network. We conduct our experiments using the Ultrasuite repository of
ultrasound and speech recordings for child speech therapy sessions. For both
tasks, we observe that systems augmented with ultrasound data outperform
corresponding systems using only the audio signal.Comment: 5 pages, 3 figures, Accepted for publication at Interspeech 201
Synchronising audio and ultrasound by learning cross-modal embeddings
Audiovisual synchronisation is the task of determining the time offset
between speech audio and a video recording of the articulators. In child speech
therapy, audio and ultrasound videos of the tongue are captured using
instruments which rely on hardware to synchronise the two modalities at
recording time. Hardware synchronisation can fail in practice, and no mechanism
exists to synchronise the signals post hoc. To address this problem, we employ
a two-stream neural network which exploits the correlation between the two
modalities to find the offset. We train our model on recordings from 69
speakers, and show that it correctly synchronises 82.9% of test utterances from
unseen therapy sessions and unseen speakers, thus considerably reducing the
number of utterances to be manually synchronised. An analysis of model
performance on the test utterances shows that directed phone articulations are
more difficult to automatically synchronise compared to utterances containing
natural variation in speech such as words, sentences, or conversations.Comment: 5 pages, 1 figure, 4 tables; Interspeech 2019 with the following
edits: 1) Loss and accuracy upon convergence were accidentally reported from
an older model. Now updated with model described throughout the paper. All
other results remain unchanged. 2) Max true offset in the training data
corrected from 179ms to 1789ms. 3) Detectability "boundary/range" renamed to
detectability "thresholds
TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio,
ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a
set of six recording sessions of one professional voice talent, a male native
speaker of English; TaL80 is a set of recording sessions of 81 native speakers
of English without voice talent experience. Overall, the corpus contains 24
hours of parallel ultrasound, video, and audio data, of which approximately
13.5 hours are speech. This paper describes the corpus and presents benchmark
results for the tasks of speech recognition, speech synthesis
(articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound
to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.Comment: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language
Technology Worksho
UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions
We introduce UltraSuite, a curated repository of ultrasound and acoustic data, collected from recordings of child speech therapy sessions. This release includes three data collections, one from typically developing children and two from children with speech sound disorders. In addition, it includes a set of annotations, some manual and some automatically produced, and software tools to process, transform and visualise the data
Manual and automatic labels for version 1.0 of UXTD, UXSSD, and UPX core data -- version 1.0
UltraSuite is a repository of ultrasound and acoustic data from child speech therapy sessions. The current release includes three data collections, one from typically developing children (UXTD) and two from children with speech sound disorders (UXSSD and UPX). This dataset contains additional materials for version 1.0 of Ultrax Typically Developing Children (UXTD), Ultrax Speech Sound Disorders (UXSSD), and UltraPhonix (UPX). It includes transcriptions, labels provided by speech therapists, reference labels with word boundaries, and automatically-derived speaker labels (therapist or child), phone boundary labels, and word boundary labels.Eshky, Aciel; Ribeiro, Manuel Sam; Cleland, Joanne; Renals, Steve; Richmond, Korin; Roxburgh, Zoe; Scobbie, James; Wrench, Alan. (2018). Manual and automatic labels for version 1.0 of UXTD, UXSSD, and UPX core data -- version 1.0, [dataset]. University of Edinburgh. School of Informatics. https://doi.org/10.7488/ds/2429