255 research outputs found
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Visual dialog (VisDial) is a task which requires an AI agent to answer a
series of questions grounded in an image. Unlike in visual question answering
(VQA), the series of questions should be able to capture a temporal context
from a dialog history and exploit visually-grounded information. A problem
called visual reference resolution involves these challenges, requiring the
agent to resolve ambiguous references in a given question and find the
references in a given image. In this paper, we propose Dual Attention Networks
(DAN) for visual reference resolution. DAN consists of two kinds of attention
networks, REFER and FIND. Specifically, REFER module learns latent
relationships between a given question and a dialog history by employing a
self-attention mechanism. FIND module takes image features and reference-aware
representations (i.e., the output of REFER module) as input, and performs
visual grounding via bottom-up attention mechanism. We qualitatively and
quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing
that DAN outperforms the previous state-of-the-art model by a significant
margin.Comment: EMNLP 201
DeepStory: Video Story QA by Deep Embedded Memory Networks
Question-answering (QA) on video contents is a significant challenge for
achieving human-level intelligence as it involves both vision and language in
real-world settings. Here we demonstrate the possibility of an AI agent
performing video story QA by learning from a large amount of cartoon videos. We
develop a video-story learning model, i.e. Deep Embedded Memory Networks
(DEMN), to reconstruct stories from a joint scene-dialogue video stream using a
latent embedding space of observed data. The video stories are stored in a
long-term memory component. For a given question, an LSTM-based attention model
uses the long-term memory to recall the best question-story-answer triplet by
focusing on specific words containing key information. We trained the DEMN on a
novel QA dataset of children's cartoon video series, Pororo. The dataset
contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained
sentences for scene description, and 8,913 story-related QA pairs. Our
experimental results show that the DEMN outperforms other QA models. This is
mainly due to 1) the reconstruction of video stories in a scene-dialogue
combined form that utilize the latent embedding and 2) attention. DEMN also
achieved state-of-the-art results on the MovieQA benchmark.Comment: 7 pages, accepted for IJCAI 201
Simulating Problem Difficulty in Arithmetic Cognition Through Dynamic Connectionist Models
The present study aims to investigate similarities between how humans and
connectionist models experience difficulty in arithmetic problems. Problem
difficulty was operationalized by the number of carries involved in solving a
given problem. Problem difficulty was measured in humans by response time, and
in models by computational steps. The present study found that both humans and
connectionist models experience difficulty similarly when solving binary
addition and subtraction. Specifically, both agents found difficulty to be
strictly increasing with respect to the number of carries. Another notable
similarity is that problem difficulty increases more steeply in subtraction
than in addition, for both humans and connectionist models. Further
investigation on two model hyperparameters --- confidence threshold and hidden
dimension --- shows higher confidence thresholds cause the model to take more
computational steps to arrive at the correct answer. Likewise, larger hidden
dimensions cause the model to take more computational steps to correctly answer
arithmetic problems; however, this effect by hidden dimensions is negligible.Comment: 7 pages; 15 figures; 5 tables; Published in the proceedings of the
17th International Conference on Cognitive Modelling (ICCM 2019
- โฆ