264 research outputs found
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Visual dialog (VisDial) is a task which requires an AI agent to answer a
series of questions grounded in an image. Unlike in visual question answering
(VQA), the series of questions should be able to capture a temporal context
from a dialog history and exploit visually-grounded information. A problem
called visual reference resolution involves these challenges, requiring the
agent to resolve ambiguous references in a given question and find the
references in a given image. In this paper, we propose Dual Attention Networks
(DAN) for visual reference resolution. DAN consists of two kinds of attention
networks, REFER and FIND. Specifically, REFER module learns latent
relationships between a given question and a dialog history by employing a
self-attention mechanism. FIND module takes image features and reference-aware
representations (i.e., the output of REFER module) as input, and performs
visual grounding via bottom-up attention mechanism. We qualitatively and
quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing
that DAN outperforms the previous state-of-the-art model by a significant
margin.Comment: EMNLP 201
Cross-Modal Alignment Learning of Vision-Language Conceptual Systems
Human infants learn the names of objects and develop their own conceptual
systems without explicit supervision. In this study, we propose methods for
learning aligned vision-language conceptual systems inspired by infants' word
learning mechanisms. The proposed model learns the associations of visual
objects and words online and gradually constructs cross-modal relational graph
networks. Additionally, we also propose an aligned cross-modal representation
learning method that learns semantic representations of visual objects and
words in a self-supervised manner based on the cross-modal relational graph
networks. It allows entities of different modalities with conceptually the same
meaning to have similar semantic representation vectors. We quantitatively and
qualitatively evaluate our method, including object-to-word mapping and
zero-shot learning tasks, showing that the proposed model significantly
outperforms the baselines and that each conceptual system is topologically
aligned.Comment: 19 pages, 4 figure
Simulating Problem Difficulty in Arithmetic Cognition Through Dynamic Connectionist Models
The present study aims to investigate similarities between how humans and
connectionist models experience difficulty in arithmetic problems. Problem
difficulty was operationalized by the number of carries involved in solving a
given problem. Problem difficulty was measured in humans by response time, and
in models by computational steps. The present study found that both humans and
connectionist models experience difficulty similarly when solving binary
addition and subtraction. Specifically, both agents found difficulty to be
strictly increasing with respect to the number of carries. Another notable
similarity is that problem difficulty increases more steeply in subtraction
than in addition, for both humans and connectionist models. Further
investigation on two model hyperparameters --- confidence threshold and hidden
dimension --- shows higher confidence thresholds cause the model to take more
computational steps to arrive at the correct answer. Likewise, larger hidden
dimensions cause the model to take more computational steps to correctly answer
arithmetic problems; however, this effect by hidden dimensions is negligible.Comment: 7 pages; 15 figures; 5 tables; Published in the proceedings of the
17th International Conference on Cognitive Modelling (ICCM 2019
- โฆ