219 research outputs found
Learning to Speak and Act in a Fantasy Text Adventure Game
We introduce a large scale crowdsourced text adventure game as a research
platform for studying grounded dialogue. In it, agents can perceive, emote, and
act whilst conducting dialogue with other agents. Models and humans can both
act as characters within the game. We describe the results of training
state-of-the-art generative and retrieval models in this setting. We show that
in addition to using past dialogue, these models are able to effectively use
the state of the underlying world to condition their predictions. In
particular, we show that grounding on the details of the local environment,
including location descriptions, and the objects (and their affordances) and
characters (and their previous actions) present within it allows better
predictions of agent behavior and dialogue. We analyze the ingredients
necessary for successful grounding in this setting, and how each of these
factors relate to agents that can talk and act successfully
From Knowledge Augmentation to Multi-tasking: Towards Human-like Dialogue Systems
The goal of building dialogue agents that can converse with humans naturally
has been a long-standing dream of researchers since the early days of
artificial intelligence. The well-known Turing Test proposed to judge the
ultimate validity of an artificial intelligence agent on the
indistinguishability of its dialogues from humans'. It should come as no
surprise that human-level dialogue systems are very challenging to build. But,
while early effort on rule-based systems found limited success, the emergence
of deep learning enabled great advance on this topic.
In this thesis, we focus on methods that address the numerous issues that
have been imposing the gap between artificial conversational agents and
human-level interlocutors. These methods were proposed and experimented with in
ways that were inspired by general state-of-the-art AI methodologies. But they
also targeted the characteristics that dialogue systems possess.Comment: PhD thesi
Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues
Compared to traditional visual question answering, video-grounded dialogues
require additional reasoning over dialogue context to answer questions in a
multi-turn setting. Previous approaches to video-grounded dialogues mostly use
dialogue context as a simple text input without modelling the inherent
information flows at the turn level. In this paper, we propose a novel
framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers
information flows among dialogue turns through a semantic graph constructed
based on lexical components in each question and answer. PDC model then learns
to predict reasoning paths over this semantic graph. Our path prediction model
predicts a path from the current turn through past dialogue turns that contain
additional visual cues to answer the current question. Our reasoning model
sequentially processes both visual and textual information through this
reasoning path and the propagated features are used to generate the answer. Our
experimental results demonstrate the effectiveness of our method and provide
additional insights on how models use semantic dependencies in a dialogue
context to retrieve visual cues.Comment: Accepted at ICLR (International Conference on Learning
Representations) 202
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Video-grounded dialogue understanding is a challenging problem that requires
machine to perceive, parse and reason over situated semantics extracted from
weakly aligned video and dialogues. Most existing benchmarks treat both
modalities the same as a frame-independent visual understanding task, while
neglecting the intrinsic attributes in multimodal dialogues, such as scene and
topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe
dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding
dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for
video-grounded dialogue understanding: scene segmentation and topic
segmentation, and one benchmark for video-grounded dialogue generation.
Comprehensive experiments are performed on these benchmarks to demonstrate the
importance of multimodal information and segments in video-grounded dialogue
understanding and generation.Comment: To appear at ACL 202
- …