1,315 research outputs found
Approximating Human Evaluation of Social Chatbots with Prompting
Once powerful conversational models have become available for a wide
audience, users started actively engaging in social interactions with this
technology. Such unprecedented interaction experiences may pose considerable
social and psychological risks to the users unless the technology is properly
controlled. This creates an urgent need for scalable and robust evaluation
metrics for conversational chatbots. Existing automatic evaluation metrics
usually focus on objective quality measures and disregard subjective
perceptions of social dimensions. Moreover, most of these approaches operate on
pre-produced dialogs from available benchmark corpora, which implies human
involvement for preparing the material for evaluation and, thus, impeded
scalability of the metrics. To address this limitation, we propose to make use
of the emerging large language models (LLMs) from the GPT-family and describe a
new framework allowing to conduct dialog system evaluation with prompting. With
this framework, we are able to achieve full automation of the evaluation
pipeline and reach impressive correlation with the human judgement (up to
Pearson r=0.95 on system level). The underlying concept is to collect synthetic
chat logs of evaluated bots with a LLM in the other-play setting, where LLM is
carefully conditioned to follow a specific scenario. We further explore
different prompting approaches to produce evaluation scores with the same LLM.
The best-performing prompts, containing few-show demonstrations and
instructions, show outstanding performance on the tested dataset and
demonstrate the ability to generalize to other dialog corpora
Hierarchical Reinforcement Learning for Open-Domain Dialog
Open-domain dialog generation is a challenging problem; maximum likelihood
training can lead to repetitive outputs, models have difficulty tracking
long-term conversational goals, and training on standard movie or online
datasets may lead to the generation of inappropriate, biased, or offensive
text. Reinforcement Learning (RL) is a powerful framework that could
potentially address these issues, for example by allowing a dialog model to
optimize for reducing toxicity and repetitiveness. However, previous approaches
which apply RL to open-domain dialog generation do so at the word level, making
it difficult for the model to learn proper credit assignment for long-term
conversational rewards. In this paper, we propose a novel approach to
hierarchical reinforcement learning, VHRL, which uses policy gradients to tune
the utterance-level embedding of a variational sequence model. This
hierarchical approach provides greater flexibility for learning long-term,
conversational rewards. We use self-play and RL to optimize for a set of
human-centered conversation metrics, and show that our approach provides
significant improvements -- in terms of both human evaluation and automatic
metrics -- over state-of-the-art dialog models, including Transformers
A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
We present a novel response generation system that can be trained end to end
on large quantities of unstructured Twitter conversations. A neural network
architecture is used to address sparsity issues that arise when integrating
contextual information into classic statistical models, allowing the system to
take into account previous dialog utterances. Our dynamic-context generative
models show consistent gains over both context-sensitive and
non-context-sensitive Machine Translation and Information Retrieval baselines.Comment: A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell,
J.-Y. Nie, J. Gao, B. Dolan. 2015. A Neural Network Approach to
Context-Sensitive Generation of Conversational Responses. In Proc. of
NAACL-HLT. Pages 196-20
- …