912 research outputs found
On the Effectiveness of Offline RL for Dialogue Response Generation
A common training technique for language models is teacher forcing (TF). TF
attempts to match human language exactly, even though identical meanings can be
expressed in different ways. This motivates use of sequence-level objectives
for dialogue response generation. In this paper, we study the efficacy of
various offline reinforcement learning (RL) methods to maximize such
objectives. We present a comprehensive evaluation across multiple datasets,
models, and metrics. Offline RL shows a clear performance improvement over
teacher forcing while not inducing training instability or sacrificing
practical training budgets.Comment: Accepted at ICML 2023. 18 pages, 12 figures. Code available at
https://github.com/asappresearch/dialogue-offline-r
Causal-aware Safe Policy Improvement for Task-oriented dialogue
The recent success of reinforcement learning's (RL) in solving complex tasks
is most often attributed to its capacity to explore and exploit an environment
where it has been trained. Sample efficiency is usually not an issue since
cheap simulators are available to sample data on-policy. On the other hand,
task oriented dialogues are usually learnt from offline data collected using
human demonstrations. Collecting diverse demonstrations and annotating them is
expensive. Unfortunately, use of RL methods trained on off-policy data are
prone to issues of bias and generalization, which are further exacerbated by
stochasticity in human response and non-markovian belief state of a dialogue
management system. To this end, we propose a batch RL framework for task
oriented dialogue policy learning: causal aware safe policy improvement
(CASPI). This method gives guarantees on dialogue policy's performance and also
learns to shape rewards according to intentions behind human responses, rather
than just mimicking demonstration data; this couple with batch-RL helps overall
with sample efficiency of the framework. We demonstrate the effectiveness of
this framework on a dialogue-context-to-text Generation and end-to-end dialogue
task of the Multiwoz2.0 dataset. The proposed method outperforms the current
state of the art on these metrics, in both case. In the end-to-end case, our
method trained only on 10\% of the data was able to out perform current state
in three out of four evaluation metrics
- …