6 research outputs found
Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context
Conversations have an intrinsic one-to-many property, which means that
multiple responses can be appropriate for the same dialog context. In
task-oriented dialogs, this property leads to different valid dialog policies
towards task completion. However, none of the existing task-oriented dialog
generation approaches takes this property into account. We propose a
Multi-Action Data Augmentation (MADA) framework to utilize the one-to-many
property to generate diverse appropriate dialog responses. Specifically, we
first use dialog states to summarize the dialog history, and then discover all
possible mappings from every dialog state to its different valid system
actions. During dialog system training, we enable the current dialog state to
map to all valid system actions discovered in the previous process to create
additional state-action pairs. By incorporating these additional pairs, the
dialog policy learns a balanced action distribution, which further guides the
dialog model to generate diverse responses. Experimental results show that the
proposed framework consistently improves dialog policy diversity, and results
in improved response diversity and appropriateness. Our model obtains
state-of-the-art results on MultiWOZ
Causal-aware Safe Policy Improvement for Task-oriented dialogue
The recent success of reinforcement learning's (RL) in solving complex tasks
is most often attributed to its capacity to explore and exploit an environment
where it has been trained. Sample efficiency is usually not an issue since
cheap simulators are available to sample data on-policy. On the other hand,
task oriented dialogues are usually learnt from offline data collected using
human demonstrations. Collecting diverse demonstrations and annotating them is
expensive. Unfortunately, use of RL methods trained on off-policy data are
prone to issues of bias and generalization, which are further exacerbated by
stochasticity in human response and non-markovian belief state of a dialogue
management system. To this end, we propose a batch RL framework for task
oriented dialogue policy learning: causal aware safe policy improvement
(CASPI). This method gives guarantees on dialogue policy's performance and also
learns to shape rewards according to intentions behind human responses, rather
than just mimicking demonstration data; this couple with batch-RL helps overall
with sample efficiency of the framework. We demonstrate the effectiveness of
this framework on a dialogue-context-to-text Generation and end-to-end dialogue
task of the Multiwoz2.0 dataset. The proposed method outperforms the current
state of the art on these metrics, in both case. In the end-to-end case, our
method trained only on 10\% of the data was able to out perform current state
in three out of four evaluation metrics
DORA: Toward Policy Optimization for Task-oriented Dialogue System with Efficient Context
Recently, reinforcement learning (RL) has been applied to task-oriented
dialogue systems by using latent actions to solve shortcomings of supervised
learning (SL). In this paper, we propose a multi-domain task-oriented dialogue
system, called Dialogue System with Optimizing a Recurrent Action Policy using
Efficient Context (DORA), that uses SL, with subsequently applied RL to
optimize dialogue systems using a recurrent dialogue policy. This dialogue
policy recurrently generates explicit system actions as a both word-level and
high-level policy. As a result, DORA is clearly optimized during both SL and RL
steps by using an explicit system action policy that considers an efficient
context instead of the entire dialogue history. The system actions are both
interpretable and controllable, whereas the latent actions are not. DORA
improved the success rate by 6.6 points on MultiWOZ 2.0 and by 10.9 points on
MultiWOZ 2.1.Comment: 23 pages, 9 figures, submitted to Computer Speech ans Language
journa