908 research outputs found
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative
Dialog policies, which determine a system's action based on the current state
at each dialog turn, are crucial to the success of the dialog. In recent years,
reinforcement learning (RL) has emerged as a promising option for dialog policy
learning (DPL). In RL-based DPL, dialog policies are updated according to
rewards. The manual construction of fine-grained rewards, such as
state-action-based ones, to effectively guide the dialog policy is challenging
in multi-domain task-oriented dialog scenarios with numerous state-action pair
combinations. One way to estimate rewards from collected data is to train the
reward estimator and dialog policy simultaneously using adversarial learning
(AL). Although this method has demonstrated superior performance
experimentally, it is fraught with the inherent problems of AL, such as mode
collapse. This paper first identifies the role of AL in DPL through detailed
analyses of the objective functions of dialog policy and reward estimator.
Next, based on these analyses, we propose a method that eliminates AL from
reward estimation and DPL while retaining its advantages. We evaluate our
method using MultiWOZ, a multi-domain task-oriented dialog corpus
Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems
Dialogue policy learning for task-oriented dialogue systems has enjoyed great
progress recently mostly through employing reinforcement learning methods.
However, these approaches have become very sophisticated. It is time to
re-evaluate it. Are we really making progress developing dialogue agents only
based on reinforcement learning? We demonstrate how (1)~traditional supervised
learning together with (2)~a simulator-free adversarial learning method can be
used to achieve performance comparable to state-of-the-art RL-based methods.
First, we introduce a simple dialogue action decoder to predict the appropriate
actions. Then, the traditional multi-label classification solution for dialogue
policy learning is extended by adding dense layers to improve the dialogue
agent performance. Finally, we employ the Gumbel-Softmax estimator to
alternatively train the dialogue agent and the dialogue reward model without
using reinforcement learning. Based on our extensive experimentation, we can
conclude the proposed methods can achieve more stable and higher performance
with fewer efforts, such as the domain knowledge required to design a user
simulator and the intractable parameter tuning in reinforcement learning. Our
main goal is not to beat reinforcement learning with supervised learning, but
to demonstrate the value of rethinking the role of reinforcement learning and
supervised learning in optimizing task-oriented dialogue systems.Comment: 10 page
Optimizing Interactive Systems via Data-Driven Objectives
Effective optimization is essential for real-world interactive systems to
provide a satisfactory user experience in response to changing user behavior.
However, it is often challenging to find an objective to optimize for
interactive systems (e.g., policy learning in task-oriented dialog systems).
Generally, such objectives are manually crafted and rarely capture complex user
needs in an accurate manner. We propose an approach that infers the objective
directly from observed user interactions. These inferences can be made
regardless of prior knowledge and across different types of user behavior. We
introduce Interactive System Optimizer (ISO), a novel algorithm that uses these
inferred objectives for optimization. Our main contribution is a new general
principled approach to optimizing interactive systems using data-driven
objectives. We demonstrate the high effectiveness of ISO over several
simulations.Comment: 30 pages, 12 figures. arXiv admin note: text overlap with
arXiv:1802.0630
- …