1,137 research outputs found
Image-Question-Answer Synergistic Network for Visual Dialog
The image, question (combined with the history for de-referencing), and the
corresponding answer are three vital components of visual dialog. Classical
visual dialog systems integrate the image, question, and history to search for
or generate the best matched answer, and so, this approach significantly
ignores the role of the answer. In this paper, we devise a novel
image-question-answer synergistic network to value the role of the answer for
precise visual dialog. We extend the traditional one-stage solution to a
two-stage solution. In the first stage, candidate answers are coarsely scored
according to their relevance to the image and question pair. Afterward, in the
second stage, answers with high probability of being correct are re-ranked by
synergizing with image and question. On the Visual Dialog v1.0 dataset, the
proposed synergistic network boosts the discriminative visual dialog model to
achieve a new state-of-the-art of 57.88\% normalized discounted cumulative
gain. A generative visual dialog model equipped with the proposed technique
also shows promising improvements.Comment: Accepted by cvpr201
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Visual dialog is a challenging vision-language task, where a dialog agent
needs to answer a series of questions through reasoning on the image content
and dialog history. Prior work has mostly focused on various attention
mechanisms to model such intricate interactions. By contrast, in this work, we
propose VD-BERT, a simple yet effective framework of unified vision-dialog
Transformer that leverages the pretrained BERT language models for Visual
Dialog tasks. The model is unified in that (1) it captures all the interactions
between the image and the multi-turn dialog using a single-stream Transformer
encoder, and (2) it supports both answer ranking and answer generation
seamlessly through the same architecture. More crucially, we adapt BERT for the
effective fusion of vision and dialog contents via visually grounded training.
Without the need of pretraining on external vision-language data, our model
yields new state of the art, achieving the top position in both single-model
and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog
leaderboard. Our code and pretrained models are released at
https://github.com/salesforce/VD-BERT.Comment: EMNLP 2020 (14 pages
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition
In an open-world setting, it is inevitable that an intelligent agent (e.g., a
robot) will encounter visual objects, attributes or relationships it does not
recognize. In this work, we develop an agent empowered with visual curiosity,
i.e. the ability to ask questions to an Oracle (e.g., human) about the contents
in images (e.g., What is the object on the left side of the red cube?) and
build visual recognition model based on the answers received (e.g., Cylinder).
In order to do this, the agent must (1) understand what it recognizes and what
it does not, (2) formulate a valid, unambiguous and informative language query
(a question) to ask the Oracle, (3) derive the parameters of visual classifiers
from the Oracle response and (4) leverage the updated visual classifiers to ask
more clarified questions. Specifically, we propose a novel framework and
formulate the learning of visual curiosity as a reinforcement learning problem.
In this framework, all components of our agent, visual recognition module (to
see), question generation policy (to ask), answer digestion module (to
understand) and graph memory module (to memorize), are learned entirely
end-to-end to maximize the reward derived from the scene graph obtained by the
agent as a consequence of the dialog with the Oracle. Importantly, the question
generation policy is disentangled from the visual recognition system and
specifics of the environment. Consequently, we demonstrate a sort of double
generalization. Our question generation policy generalizes to new environments
and a new pair of eyes, i.e., new visual system. Trained on a synthetic
dataset, our results show that our agent learns new visual concepts
significantly faster than several heuristic baselines, even when tested on
synthetic environments with novel objects, as well as in a realistic
environment.Comment: 18 pages, 10 figures, Oral Presentation in Conference on Robot
Learning (CoRL) 201
History for Visual Dialog: Do we really need it?
Visual Dialog involves "understanding" the dialog history (what has been
discussed previously) and the current question (what is asked), in addition to
grounding information in the image, to generate the correct response. In this
paper, we show that co-attention models which explicitly encode dialog history
outperform models that don't, achieving state-of-the-art performance (72 % NDCG
on val set). However, we also expose shortcomings of the crowd-sourcing dataset
collection procedure by showing that history is indeed only required for a
small amount of the data and that the current evaluation metric encourages
generic replies. To that end, we propose a challenging subset (VisDialConv) of
the VisDial val set and provide a benchmark of 63% NDCG.Comment: ACL'2
Iterative Context-Aware Graph Inference for Visual Dialog
Visual dialog is a challenging task that requires the comprehension of the
semantic dependencies among implicit visual and textual contexts. This task can
refer to the relation inference in a graphical model with sparse contexts and
unknown graph structure (relation descriptor), and how to model the underlying
context-aware relation inference is critical. To this end, we propose a novel
Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to
a joint semantic feature, including both object-based (visual) and
history-related (textual) context representations. The graph structure
(relations in dialog) is iteratively updated using an adaptive top- message
passing mechanism. Specifically, in every message passing step, each node
selects the most relevant nodes, and only receives messages from them.
Then, after the update, we impose graph attention on all the nodes to get the
final graph embedding and infer the answer. In CAG, each node has dynamic
relations in the graph (different related neighbor nodes), and only the
most relevant nodes are attributive to the context-aware relational graph
inference. Experimental results on VisDial v0.9 and v1.0 datasets show that CAG
outperforms comparative methods. Visualization results further validate the
interpretability of our method
SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space
In this work, we formulate a visual dialog as an information flow in which
each piece of information is encoded with the joint visual-linguistic
representation of a single dialog round. Based on this formulation, we consider
the visual dialog task as a sequence problem consisting of ordered
visual-linguistic vectors. For featurization, we use a Dense Symmetric
Co-Attention network as a lightweight vison-language joint representation
generator to fuse multimodal features (i.e., image and text), yielding better
computation and data efficiencies. For inference, we propose two Sequential
Dialog Networks (SeqDialN): the first uses LSTM for information propagation
(IP) and the second uses a modified Transformer for multi-step reasoning (MR).
Our architecture separates the complexity of multimodal feature fusion from
that of inference, which allows simpler design of the inference engine. IP
based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves
decent performance. MR based SeqDialN, on the other hand, recurrently refines
the semantic question/history representations through the self-attention stack
of Transformer and produces promising results on the visual dialog task. On
VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves
62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78%
NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog
model. We fine-tune discriminative SeqDialN with dense annotations and boost
the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the
extensive experiments we have conducted to demonstrate the effectiveness of our
model components. We also provide visualization for the reasoning process from
the relevant conversation rounds and discuss our fine-tuning methods. Our code
is available at https://github.com/xiaoxiaoheimei/SeqDialNComment: 18 pages, 4 figures, 5 table
Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog
Stickers with vivid and engaging expressions are becoming increasingly
popular in online messaging apps, and some works are dedicated to automatically
select sticker response by matching the stickers image with previous
utterances. However, existing methods usually focus on measuring the matching
degree between the dialog context and sticker image, which ignores the user
preference of using stickers. Hence, in this paper, we propose to recommend an
appropriate sticker to user based on multi-turn dialog context and sticker
using history of user. Two main challenges are confronted in this task. One is
to model the sticker preference of user based on the previous sticker selection
history. Another challenge is to jointly fuse the user preference and the
matching between dialog context and candidate sticker into final prediction
making. To tackle these challenges, we propose a \emph{Preference Enhanced
Sticker Response Selector} (PESRS) model. Specifically, PESRS first employs a
convolutional based sticker image encoder and a self-attention based multi-turn
dialog encoder to obtain the representation of stickers and utterances. Next,
deep interaction network is proposed to conduct deep matching between the
sticker and each utterance. Then, we model the user preference by using the
recently selected stickers as input, and use a key-value memory network to
store the preference representation. PESRS then learns the short-term and
long-term dependency between all interaction results by a fusion network, and
dynamically fuse the user preference representation into the final sticker
selection prediction. Extensive experiments conducted on a large-scale
real-world dialog dataset show that our model achieves the state-of-the-art
performance for all commonly-used metrics. Experiments also verify the
effectiveness of each component of PESRS.Comment: Accepted by TOIS. arXiv admin note: substantial text overlap with
arXiv:2003.0467
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
The interest in Artificial Intelligence (AI) and its applications has seen
unprecedented growth in the last few years. This success can be partly
attributed to the advancements made in the sub-fields of AI such as Machine
Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). The
largest of the growths in these fields has been made possible with deep
learning, a sub-area of machine learning, which uses the principles of
artificial neural networks. This has created significant interest in the
integration of vision and language. The tasks are designed such that they
perfectly embrace the ideas of deep learning. In this survey, we focus on ten
prominent tasks that integrate language and vision by discussing their problem
formulations, methods, existing datasets, evaluation measures, and compare the
results obtained with corresponding state-of-the-art methods. Our efforts go
beyond earlier surveys which are either task-specific or concentrate only on
one type of visual content, i.e., image or video. Furthermore, we also provide
some potential future directions in this field of research with an anticipation
that this survey brings in innovative thoughts and ideas to address the
existing challenges and build new applications.Comment: Accepted at Journal of Artificial Intelligence Research (JAIR
Learning Compositional Representation for Few-shot Visual Question Answering
Current methods of Visual Question Answering perform well on the answers with
an amount of training data but have limited accuracy on the novel ones with few
examples. However, humans can quickly adapt to these new categories with just a
few glimpses, as they learn to organize the concepts that have been seen before
to figure the novel class, which are hardly explored by the deep learning
methods. Therefore, in this paper, we propose to extract the attributes from
the answers with enough data, which are later composed to constrain the
learning of the few-shot ones. We generate the few-shot dataset of VQA with a
variety of answers and their attributes without any human effort. With this
dataset, we build our attribute network to disentangle the attributes by
learning their features from parts of the image instead of the whole one.
Experimental results on the VQA v2.0 validation dataset demonstrate the
effectiveness of our proposed attribute network and the constraint between
answers and their corresponding attributes, as well as the ability of our
method to handle the answers with few training examples
視覚と言語情報を理解し人間や環境と作用し合う機械知能
Tohoku University博士(情報科学)thesi
- …