24 research outputs found
Revisiting EmbodiedQA: A Simple Baseline and Beyond.
In Embodied Question Answering (EmbodiedQA), an agent interacts with an environment to gather necessary information for answering user questions. Existing works have laid a solid foundation towards solving this interesting problem. But the current performance, especially in navigation, suggests that EmbodiedQA might be too challenging for the contemporary approaches. In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions. In this new setting, we randomly place a few objects in new environments, and upgrade the agent policy by a distillation network to retain the generalization ability from the trained model. On the EmbodiedQA v1 benchmark, under the standard setting, our simple baseline achieves very competitive results to the-state-of-the-art; in the new setting, we found the introduced small change in settings yields a notable gain in navigation
Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study
Neural generative models have been become increasingly popular when building
conversational agents. They offer flexibility, can be easily adapted to new
domains, and require minimal domain engineering. A common criticism of these
systems is that they seldom understand or use the available dialog history
effectively. In this paper, we take an empirical approach to understanding how
these models use the available dialog history by studying the sensitivity of
the models to artificially introduced unnatural changes or perturbations to
their context at test time. We experiment with 10 different types of
perturbations on 4 multi-turn dialog datasets and find that commonly used
neural dialog architectures like recurrent and transformer-based seq2seq models
are rarely sensitive to most perturbations such as missing or reordering
utterances, shuffling words, etc. Also, by open-sourcing our code, we believe
that it will serve as a useful diagnostic tool for evaluating dialog systems in
the future.Comment: To appear at ACL 2019(oral; nominated for best paper
VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering
Embodied Question Answering (EQA) is a recently proposed task, where an agent
is placed in a rich 3D environment and must act based solely on its egocentric
input to answer a given question. The desired outcome is that the agent learns
to combine capabilities such as scene understanding, navigation and language
understanding in order to perform complex reasoning in the visual world.
However, initial advancements combining standard vision and language methods
with imitation and reinforcement learning algorithms have shown EQA might be
too complex and challenging for these techniques. In order to investigate the
feasibility of EQA-type tasks, we build the VideoNavQA dataset that contains
pairs of questions and videos generated in the House3D environment. The goal of
this dataset is to assess question-answering performance from nearly-ideal
navigation paths, while considering a much more complete variety of questions
than current instantiations of the EQA task. We investigate several models,
adapted from popular VQA methods, on this new benchmark. This establishes an
initial understanding of how well VQA-style methods can perform within this
novel EQA paradigm.CC is funded by DREAM CDT and was supported by Mila during the time
in Montréal. EB is funded by IVADO. We also thank the University of Cambridge Research
Computing Services for providing HPC cluster resources