27 research outputs found
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.Comment: Accepted at AAAI 202
Visually Grounding Instruction for History-Dependent Manipulation
This paper emphasizes the importance of robot's ability to refer its task
history, when it executes a series of pick-and-place manipulations by following
text instructions given one by one. The advantage of referring the manipulation
history can be categorized into two folds: (1) the instructions omitting
details or using co-referential expressions can be interpreted, and (2) the
visual information of objects occluded by previous manipulations can be
inferred. For this challenge, we introduce the task of history-dependent
manipulation which is to visually ground a series of text instructions for
proper manipulations depending on the task history. We also suggest a relevant
dataset and a methodology based on the deep neural network, and show that our
network trained with a synthetic dataset can be applied to the real world based
on images transferred into synthetic-style based on the CycleGAN.Comment: 8 pages, 6 figure