78 research outputs found
The Dialog State Tracking Challenge Series: A Review
In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation -- such as the user's goal -- given all of the dialog history up to that turn. Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress. The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker. This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled
A Multi-Task Approach to Incremental Dialogue State Tracking
Incrementality is a fundamental feature of language in real world use. To this point, however, the vast majority of work in automated dialogue processing has focused on language as turn based. In this paper we explore the challenge of incremental dialogue state tracking through the development and analysis of a multi-task approach to incremental dialogue state tracking. We present the design of our incremental dialogue state tracker in detail and provide evaluation against the well known Dialogue State Tracking Challenge 2 (DSTC2) dataset. In addition to a standard evaluation of the tracker, we also provide an analysis of the Incrementality phenomenon in our model’s performance by analyzing how early our models can produce correct predictions and how stable those predictions are. We find that the Multi-Task Learning-based model achieves state-of-the-art results for incremental processing
Structured Dialogue State Management for Task-Oriented Dialogue Systems
Human-machine conversational agents have developed at a rapid pace in recent years, bolstered through the application of advanced technologies such as deep learning. Today, dialogue systems are useful in assisting users in various activities, especially task-oriented dialogue systems in specific dialogue domains. However, they continue to be limited in many ways. Arguably the biggest challenge lies in the complexity of natural language and interpersonal communication, and the lack of human context and knowledge available to these systems. This leads to the question of whether dialogue systems, and in particular task-oriented dialogue systems, can be enhanced to leverage various language properties. This work focuses on the semantic structural properties of language in task-oriented dialogue systems. These structural properties are manifest by variable dependencies in dialogue domains; and the study of and accounting for these variables and their interdependencies is the main objective of this research.
Contemporary task-oriented dialogue systems are typically developed with a multiple component architecture, where each component is responsible for a specific process in the conversational interaction. It is commonly accepted that the ability to understand user input in a conversational context, a responsibility generally assigned to the dialogue state tracking component, contributes a huge part to the overall performance of dialogue systems. The output of the dialogue state tracking component, so-called dialogue states, are a representation of the aspects of a dialogue relevant to the completion of a task up to that point, and should also capture the task structural properties of natural language. Here, in a dialogue context dialogue state variables are expressed through dialogue slots and slot values, hence the dialogue state variable dependencies are expressed as the dependencies between dialogue slots and their values. Incorporating slot dependencies in the dialogue state tracking process is herein hypothesised to enhance the accuracy of postulated dialogue states, and subsequently potentially improve the performance of task-oriented dialogue systems.
Given this overall goal and approach to the improvement of dialogue systems, the work in this dissertation can be broken down into two related contributions: (i) a study of structural properties in dialogue states; and (ii) the investigation of novel modelling approaches to capture slot dependencies in dialogue domains.
The analysis of language\u27s structural properties was conducted with a corpus-based study to investigate whether variable dependencies, i.e., slot dependencies when using dialogue system terminology, exist in dialogue domains, and if yes, to what extent do these dependencies affect the dialogue state tracking process. A number of public dialogue corpora were chosen for analysis with a collection of statistical methods being applied to their analysis.
Deep learning architectures have been shown in various works to be an effective method to model conversations and different types of machine learning challenges. In this research, in order to account for slot dependencies, a number of deep learning-based models were experimented with for the dialogue state tracking task. In particular, a multi-task learning system was developed to study the leveraging of common features and shared knowledge in the training of dialogue state tracking subtasks such as tracking different slots, hence investigating the associations between these slots. Beyond that, a structured prediction method, based on energy-based learning, was also applied to account for explicit dialogue slot dependencies.
The study results show promising directions for solving the dialogue state tracking challenge for task-oriented dialogue systems. By accounting for slot dependencies in dialogue domains, dialogue states were produced more accurately when benchmarked against comparative modelling methods that do not take advantage of the same principle. Furthermore, the structured prediction method is applicable to various state-of-the-art modelling approaches for further study.
In the long term, the study of dialogue state slot dependencies can potentially be expanded to a wider range of conversational aspects such as personality, preferences, and modalities, as well as user intents
A data-driven approach to spoken dialog segmentation
In This Paper, We Present A Statistical Model For Spoken Dialog Segmentation That Decides The Current Phase Of The Dialog By Means Of An Automatic Classification Process. We Have Applied Our Proposal To Three Practical Conversational Systems Acting In Different Domains. The Results Of The Evaluation Show That Is Possible To Attain High Accuracy Rates In Dialog Segmentation When Using Different Sources Of Information To Represent The User Input. Our Results Indicate How The Module Proposed Can Also Improve Dialog Management By Selecting Better System Answers. The Statistical Model Developed With Human-Machine Dialog Corpora Has Been Applied In One Of Our Experiments To Human-Human Conversations And Provides A Good Baseline As Well As Insights In The Model Limitation
Recommended from our members
Discriminative methods for statistical spoken dialogue systems
Dialogue promises a natural and effective method for users to interact with and obtain information from computer systems. Statistical spoken dialogue systems are able to disambiguate in the presence of errors by maintaining probability distributions over what they believe to be the state of a dialogue. However, traditionally these distributions have been derived using generative models, which do not directly optimise for the criterion of interest and cannot easily exploit arbitrary information that may potentially be useful. This thesis presents how discriminative methods can overcome these problems in Spoken Language Understanding (SLU) and Dialogue State Tracking (DST).
A robust method for SLU is proposed, based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. This method uses discriminative classifiers, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus, and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance in terms of both accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the system's output.
For DST, a new word-based tracking method is presented that maps directly from the speech recognition results to the dialogue state without using an explicit semantic decoder. The method is based on a recurrent neural network structure that is capable of generalising to unseen dialogue state hypotheses, and requires very little feature engineering. The method is evaluated in the second and third Dialog State Tracking Challenges, as well as in a live user trial. The results demonstrate consistently high performance across all of the off-line metrics and a substantial increase in the quality of the dialogues in the live trial. The proposed method is shown to be readily applied to expanding dialogue domains, by exploiting robust features and a new method for online unsupervised adaptation. It is shown how the neural network structure can be adapted to output structured joint distributions, giving an improvement over estimating the dialogue state as a product of marginal distributions
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Recommended from our members
Recurrent Neural Network Language Generation for Dialogue Systems
Language is the principal medium for ideas, while dialogue is the most natural and effective way for humans to interact with and access information from machines. Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact on usability and perceived quality. Many commonly used NLG systems employ rules and heuristics, which tend to generate inflexible and stylised responses without the natural variation of human language. However, the frequent repetition of identical output forms can quickly make dialogue become tedious for most real-world users. Additionally, these rules and heuristics are not scalable and hence not trivially extensible to other domains or languages. A statistical approach to language generation can learn language decisions directly from data without relying on hand-coded rules or heuristics, which brings scalability and flexibility to NLG. Statistical models also provide an opportunity to learn in-domain human colloquialisms and cross-domain model adaptations.
A robust and quasi-supervised NLG model is proposed in this thesis. The model leverages a Recurrent Neural Network (RNN)-based surface realiser and a gating mechanism applied to input semantics. The model is motivated by the Long-Short Term Memory (LSTM) network. The RNN-based surface realiser and gating mechanism use a neural network to learn end-to-end language generation decisions from input dialogue act and sentence pairs; it also integrates sentence planning and surface realisation into a single optimisation problem. The single optimisation not only bypasses the costly intermediate linguistic annotations but also generates more natural and human-like responses. Furthermore, a domain adaptation study shows that the proposed model can be readily adapted and extended to new dialogue domains via a proposed recipe.
Continuing the success of end-to-end learning, the second part of the thesis speculates on building an end-to-end dialogue system by framing it as a conditional generation problem. The proposed model encapsulates a belief tracker with a minimal state representation and a generator that takes the dialogue context to produce responses. These features suggest comprehension and fast learning. The proposed model is capable of understanding requests and accomplishing tasks after training on only a few hundred human-human dialogues. A complementary Wizard-of-Oz data collection method is also introduced to facilitate the collection of human-human conversations from online workers. The results demonstrate that the proposed model can talk to human judges naturally, without any difficulty, for a sample application domain. In addition, the results also suggest that the introduction of a stochastic latent variable can help the system model intrinsic variation in communicative intention much better.Tsung-Hsien Wen's Ph.D. is supported by Toshiba Research Europe Ltd, Cambridge Research Laborator
Reinforcement Learning for Generative AI: A Survey
Deep Generative AI has been a long-standing essential topic in the machine
learning community, which can impact a number of application areas like text
generation and computer vision. The major paradigm to train a generative model
is maximum likelihood estimation, which pushes the learner to capture and
approximate the target data distribution by decreasing the divergence between
the model distribution and the target distribution. This formulation
successfully establishes the objective of generative tasks, while it is
incapable of satisfying all the requirements that a user might expect from a
generative model. Reinforcement learning, serving as a competitive option to
inject new training signals by creating new objectives that exploit novel
signals, has demonstrated its power and flexibility to incorporate human
inductive bias from multiple angles, such as adversarial learning,
hand-designed rules and learned reward model to build a performant model.
Thereby, reinforcement learning has become a trending research field and has
stretched the limits of generative AI in both model design and application. It
is reasonable to summarize and conclude advances in recent years with a
comprehensive review. Although there are surveys in different application areas
recently, this survey aims to shed light on a high-level review that spans a
range of application areas. We provide a rigorous taxonomy in this area and
make sufficient coverage on various models and applications. Notably, we also
surveyed the fast-developing large language model area. We conclude this survey
by showing the potential directions that might tackle the limit of current
models and expand the frontiers for generative AI
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection
- …