Search CORE

912 research outputs found

On the Effectiveness of Offline RL for Dialogue Response Generation

Author: Elenberg Ethan R.
McDonald Ryan
Sodhi Paloma
Weinberger Kilian Q.
Wu Felix
Publication venue
Publication date: 23/07/2023
Field of study

A common training technique for language models is teacher forcing (TF). TF attempts to match human language exactly, even though identical meanings can be expressed in different ways. This motivates use of sequence-level objectives for dialogue response generation. In this paper, we study the efficacy of various offline reinforcement learning (RL) methods to maximize such objectives. We present a comprehensive evaluation across multiple datasets, models, and metrics. Offline RL shows a clear performance improvement over teacher forcing while not inducing training instability or sacrificing practical training budgets.Comment: Accepted at ICML 2023. 18 pages, 12 figures. Code available at https://github.com/asappresearch/dialogue-offline-r

arXiv.org e-Print Archive

Causal-aware Safe Policy Improvement for Task-oriented dialogue

Author: Hashimoto Kazuma
Ramachandran Govardana Sachithanandam
Xiong Caiming
Publication venue
Publication date: 10/03/2021
Field of study

The recent success of reinforcement learning's (RL) in solving complex tasks is most often attributed to its capacity to explore and exploit an environment where it has been trained. Sample efficiency is usually not an issue since cheap simulators are available to sample data on-policy. On the other hand, task oriented dialogues are usually learnt from offline data collected using human demonstrations. Collecting diverse demonstrations and annotating them is expensive. Unfortunately, use of RL methods trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian belief state of a dialogue management system. To this end, we propose a batch RL framework for task oriented dialogue policy learning: causal aware safe policy improvement (CASPI). This method gives guarantees on dialogue policy's performance and also learns to shape rewards according to intentions behind human responses, rather than just mimicking demonstration data; this couple with batch-RL helps overall with sample efficiency of the framework. We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art on these metrics, in both case. In the end-to-end case, our method trained only on 10\% of the data was able to out perform current state in three out of four evaluation metrics

arXiv.org e-Print Archive