29 research outputs found
Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning
Training task-completion dialogue agents with reinforcement learning usually
requires a large number of real user experiences. The Dyna-Q algorithm extends
Q-learning by integrating a world model, and thus can effectively boost
training efficiency using simulated experiences generated by the world model.
The effectiveness of Dyna-Q, however, depends on the quality of the world model
- or implicitly, the pre-specified ratio of real vs. simulated experiences used
for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ)
framework by integrating a switcher that automatically determines whether to
use a real or simulated experience for Q-learning. Furthermore, we explore the
use of active learning for improving sample efficiency, by encouraging the
world model to generate simulated experiences in the state-action space where
the agent has not (fully) explored. Our results show that by combining switcher
and active learning, the new framework named as Switch-based Active Deep Dyna-Q
(Switch-DDQ), leads to significant improvement over DDQ and Q-learning
baselines in both simulation and human evaluations.Comment: 8 pages, 9 figures, AAAI 201
Deep Reinforcement Learning for Dialogue Generation
Recent neural models of dialogue generation offer great promise for
generating responses for conversational agents, but tend to be shortsighted,
predicting utterances one at a time while ignoring their influence on future
outcomes. Modeling the future direction of a dialogue is crucial to generating
coherent, interesting dialogues, a need which led traditional NLP models of
dialogue to draw on reinforcement learning. In this paper, we show how to
integrate these goals, applying deep reinforcement learning to model future
reward in chatbot dialogue. The model simulates dialogues between two virtual
agents, using policy gradient methods to reward sequences that display three
useful conversational properties: informativity (non-repetitive turns),
coherence, and ease of answering (related to forward-looking function). We
evaluate our model on diversity, length as well as with human judges, showing
that the proposed algorithm generates more interactive responses and manages to
foster a more sustained conversation in dialogue simulation. This work marks a
first step towards learning a neural conversational model based on the
long-term success of dialogues
Goal-oriented Dialogue Policy Learning from Failures
Reinforcement learning methods have been used for learning dialogue policies.
However, learning an effective dialogue policy frequently requires
prohibitively many conversations. This is partly because of the sparse rewards
in dialogues, and the very few successful dialogues in early learning phase.
Hindsight experience replay (HER) enables learning from failures, but the
vanilla HER is inapplicable to dialogue learning due to the implicit goals. In
this work, we develop two complex HER methods providing different trade-offs
between complexity and performance, and, for the first time, enabled HER-based
dialogue policy learning. Experiments using a realistic user simulator show
that our HER methods perform better than existing experience replay methods (as
applied to deep Q-networks) in learning rate
Adversarial Learning for Neural Dialogue Generation
In this paper, drawing intuition from the Turing test, we propose using
adversarial training for open-domain dialogue generation: the system is trained
to produce sequences that are indistinguishable from human-generated dialogue
utterances. We cast the task as a reinforcement learning (RL) problem where we
jointly train two systems, a generative model to produce response sequences,
and a discriminator---analagous to the human evaluator in the Turing test--- to
distinguish between the human-generated dialogues and the machine-generated
ones. The outputs from the discriminator are then used as rewards for the
generative model, pushing the system to generate dialogues that mostly resemble
human dialogues.
In addition to adversarial training we describe a model for adversarial {\em
evaluation} that uses success in fooling an adversary as a dialogue evaluation
metric, while avoiding a number of potential pitfalls. Experimental results on
several metrics, including adversarial evaluation, demonstrate that the
adversarially-trained system generates higher-quality responses than previous
baselines