45 research outputs found
Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning
Training task-completion dialogue agents with reinforcement learning usually
requires a large number of real user experiences. The Dyna-Q algorithm extends
Q-learning by integrating a world model, and thus can effectively boost
training efficiency using simulated experiences generated by the world model.
The effectiveness of Dyna-Q, however, depends on the quality of the world model
- or implicitly, the pre-specified ratio of real vs. simulated experiences used
for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ)
framework by integrating a switcher that automatically determines whether to
use a real or simulated experience for Q-learning. Furthermore, we explore the
use of active learning for improving sample efficiency, by encouraging the
world model to generate simulated experiences in the state-action space where
the agent has not (fully) explored. Our results show that by combining switcher
and active learning, the new framework named as Switch-based Active Deep Dyna-Q
(Switch-DDQ), leads to significant improvement over DDQ and Q-learning
baselines in both simulation and human evaluations.Comment: 8 pages, 9 figures, AAAI 201
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems
We present a new algorithm that significantly improves the efficiency of
exploration for deep Q-learning agents in dialogue systems. Our agents explore
via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop
neural network. Our algorithm learns much faster than common exploration
strategies such as -greedy, Boltzmann, bootstrapping, and
intrinsic-reward-based ones. Additionally, we show that spiking the replay
buffer with experiences from just a few successful episodes can make Q-learning
feasible when it might otherwise fail.Comment: 13 pages, 9 figure
Goal-oriented Dialogue Policy Learning from Failures
Reinforcement learning methods have been used for learning dialogue policies.
However, learning an effective dialogue policy frequently requires
prohibitively many conversations. This is partly because of the sparse rewards
in dialogues, and the very few successful dialogues in early learning phase.
Hindsight experience replay (HER) enables learning from failures, but the
vanilla HER is inapplicable to dialogue learning due to the implicit goals. In
this work, we develop two complex HER methods providing different trade-offs
between complexity and performance, and, for the first time, enabled HER-based
dialogue policy learning. Experiments using a realistic user simulator show
that our HER methods perform better than existing experience replay methods (as
applied to deep Q-networks) in learning rate