43 research outputs found
Training Millions of Personalized Dialogue Agents
Current dialogue systems are not very engaging for users, especially when
trained end-to-end without relying on proactive reengaging scripted strategies.
Zhang et al. (2018) showed that the engagement level of end-to-end dialogue
models increases when conditioning them on text personas providing some
personalized back-story to the model. However, the dataset used in Zhang et al.
(2018) is synthetic and of limited size as it contains around 1k different
personas. In this paper we introduce a new dataset providing 5 million personas
and 700 million persona-based dialogues. Our experiments show that, at this
scale, training using personas still improves the performance of end-to-end
systems. In addition, we show that other tasks benefit from the wide coverage
of our dataset by fine-tuning our model on the data from Zhang et al. (2018)
and achieving state-of-the-art results.Comment: EMNLP 201
Image Chat: Engaging Grounded Conversations
To achieve the long-term goal of machines being able to engage humans in
conversation, our models should captivate the interest of their speaking
partners. Communication grounded in images, whereby a dialogue is conducted
based on a given photo, is a setup naturally appealing to humans (Hu et al.,
2014). In this work we study large-scale architectures and datasets for this
goal. We test a set of neural architectures using state-of-the-art image and
text representations, considering various ways to fuse the components. To test
such models, we collect a dataset of grounded human-human conversations, where
speakers are asked to play roles given a provided emotional mood or style, as
the use of such traits is also a key factor in engagingness (Guo et al., 2019).
Our dataset, Image-Chat, consists of 202k dialogues over 202k images using 215
possible style traits. Automatic metrics and human evaluations of engagingness
show the efficacy of our approach; in particular, we obtain state-of-the-art
performance on the existing IGC task, and our best performing model is almost
on par with humans on the Image-Chat test set (preferred 47.7% of the time).Comment: ACL 202
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
The use of deep pre-trained bidirectional transformers has led to remarkable
progress in a number of applications (Devlin et al., 2018). For tasks that make
pairwise comparisons between sequences, matching a given input with a
corresponding label, two approaches are common: Cross-encoders performing full
self-attention over the pair and Bi-encoders encoding the pair separately. The
former often performs better, but is too slow for practical use. In this work,
we develop a new transformer architecture, the Poly-encoder, that learns global
rather than token level self-attention features. We perform a detailed
comparison of all three approaches, including what pre-training and fine-tuning
strategies work best. We show our models achieve state-of-the-art results on
three existing tasks; that Poly-encoders are faster than Cross-encoders and
more accurate than Bi-encoders; and that the best results are obtained by
pre-training on large datasets similar to the downstream tasks.Comment: ICLR 202
Filtering before Iteratively Referring for Knowledge-Grounded Response Selection in Retrieval-Based Chatbots
The challenges of building knowledge-grounded retrieval-based chatbots lie in
how to ground a conversation on its background knowledge and how to match
response candidates with both context and knowledge simultaneously. This paper
proposes a method named Filtering before Iteratively REferring (FIRE) for this
task. In this method, a context filter and a knowledge filter are first built,
which derive knowledge-aware context representations and context-aware
knowledge representations respectively by global and bidirectional attention.
Besides, the entries irrelevant to the conversation are discarded by the
knowledge filter. After that, iteratively referring is performed between
context and response representations as well as between knowledge and response
representations, in order to collect deep matching features for scoring
response candidates. Experimental results show that FIRE outperforms previous
methods by margins larger than 2.8% and 4.1% on the PERSONA-CHAT dataset with
original and revised personas respectively, and margins larger than 3.1% on the
CMU_DoG dataset in terms of top-1 accuracy. We also show that FIRE is more
interpretable by visualizing the knowledge grounding process.Comment: Accepted by EMNLP 2020 Finding
Personalized Query Rewriting in Conversational AI Agents
Spoken language understanding (SLU) systems in conversational AI agents often
experience errors in the form of misrecognitions by automatic speech
recognition (ASR) or semantic gaps in natural language understanding (NLU).
These errors easily translate to user frustrations, particularly so in
recurrent events e.g. regularly toggling an appliance, calling a frequent
contact, etc. In this work, we propose a query rewriting approach by leveraging
users' historically successful interactions as a form of memory. We present a
neural retrieval model and a pointer-generator network with hierarchical
attention and show that they perform significantly better at the query
rewriting task with the aforementioned user memories than without. We also
highlight how our approach with the proposed models leverages the structural
and semantic diversity in ASR's output towards recovering users' intents.Comment: 5 pages, 3 figure
Enriching Conversation Context in Retrieval-based Chatbots
Work on retrieval-based chatbots, like most sequence pair matching tasks, can
be divided into Cross-encoders that perform word matching over the pair, and
Bi-encoders that encode the pair separately. The latter has better performance,
however since candidate responses cannot be encoded offline, it is also much
slower. Lately, multi-layer transformer architectures pre-trained as language
models have been used to great effect on a variety of natural language
processing and information retrieval tasks. Recent work has shown that these
language models can be used in text-matching scenarios to create Bi-encoders
that perform almost as well as Cross-encoders while having a much faster
inference speed. In this paper, we expand upon this work by developing a
sequence matching architecture that %takes into account contexts in the
training dataset at inference time. utilizes the entire training set as a
makeshift knowledge-base during inference. We perform detailed experiments
demonstrating that this architecture can be used to further improve Bi-encoders
performance while still maintaining a relatively high inference speed.Comment: 8 pages, 1 figure, 3 table
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
We introduce dodecaDialogue: a set of 12 tasks that measures if a
conversational agent can communicate engagingly with personality and empathy,
ask questions, answer questions by utilizing knowledge resources, discuss
topics and situations, and perceive and converse about images. By multi-tasking
on such a broad large-scale set of data, we hope to both move towards and
measure progress in producing a single unified agent that can perceive, reason
and converse with humans in an open-domain setting. We show that such
multi-tasking improves over a BERT pre-trained baseline, largely due to
multi-tasking with very large dialogue datasets in a similar domain, and that
the multi-tasking in general provides gains to both text and image-based tasks
using several metrics in both the fine-tune and task transfer settings. We
obtain state-of-the-art results on many of the tasks, providing a strong
baseline for this challenge.Comment: ACL 202
Wizard of Wikipedia: Knowledge-Powered Conversational agents
In open-domain dialogue intelligent agents should exhibit the use of
knowledge, however there are few convincing demonstrations of this to date. The
most popular sequence to sequence models typically "generate and hope" generic
utterances that can be memorized in the weights of the model when mapping from
input utterance(s) to output, rather than employing recalled knowledge as
context. Use of knowledge has so far proved difficult, in part because of the
lack of a supervised learning benchmark task which exhibits knowledgeable open
dialogue with clear grounding. To that end we collect and release a large
dataset with conversations directly grounded with knowledge retrieved from
Wikipedia. We then design architectures capable of retrieving knowledge,
reading and conditioning on it, and finally generating natural responses. Our
best performing dialogue models are able to conduct knowledgeable discussions
on open-domain topics as evaluated by automatic metrics and human evaluations,
while our new benchmark allows for measuring further improvements in this
important research direction
Distilling Knowledge for Fast Retrieval-based Chat-bots
Response retrieval is a subset of neural ranking in which a model selects a
suitable response from a set of candidates given a conversation history.
Retrieval-based chat-bots are typically employed in information seeking
conversational systems such as customer support agents. In order to make
pairwise comparisons between a conversation history and a candidate response,
two approaches are common: cross-encoders performing full self-attention over
the pair and bi-encoders encoding the pair separately. The former gives better
prediction quality but is too slow for practical use. In this paper, we propose
a new cross-encoder architecture and transfer knowledge from this model to a
bi-encoder model using distillation. This effectively boosts bi-encoder
performance at no cost during inference time. We perform a detailed analysis of
this approach on three response retrieval datasets.Comment: Accepted for publication in the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR '20
Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models
Graphical emoji are ubiquitous in modern-day online conversations. So is a
single thumbs-up emoji able to signify an agreement, without any words. We
argue that the current state-of-the-art systems are ill-equipped to correctly
interpret these emoji, especially in a conversational context. However, in a
casual context, the benefits might be high: a better understanding of users'
utterances and more natural, emoji-rich responses.
With this in mind, we modify BERT to fully support emoji, both from the
Unicode Standard and custom emoji. This modified BERT is then trained on a
corpus of question-answer (QA) tuples with a high number of emoji, where we're
able to increase the 1-of-100 accuracy from 12.7% for the current
state-of-the-art to 17.8% for our model with emoji support.Comment: Accepted at Benelearn 201