5,233 research outputs found
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Visual dialog (VisDial) is a task which requires an AI agent to answer a
series of questions grounded in an image. Unlike in visual question answering
(VQA), the series of questions should be able to capture a temporal context
from a dialog history and exploit visually-grounded information. A problem
called visual reference resolution involves these challenges, requiring the
agent to resolve ambiguous references in a given question and find the
references in a given image. In this paper, we propose Dual Attention Networks
(DAN) for visual reference resolution. DAN consists of two kinds of attention
networks, REFER and FIND. Specifically, REFER module learns latent
relationships between a given question and a dialog history by employing a
self-attention mechanism. FIND module takes image features and reference-aware
representations (i.e., the output of REFER module) as input, and performs
visual grounding via bottom-up attention mechanism. We qualitatively and
quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing
that DAN outperforms the previous state-of-the-art model by a significant
margin.Comment: EMNLP 201
Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction
The visual focus of attention (VFOA) has been recognized as a prominent
conversational cue. We are interested in estimating and tracking the VFOAs
associated with multi-party social interactions. We note that in this type of
situations the participants either look at each other or at an object of
interest; therefore their eyes are not always visible. Consequently both gaze
and VFOA estimation cannot be based on eye detection and tracking. We propose a
method that exploits the correlation between eye gaze and head movements. Both
VFOA and gaze are modeled as latent variables in a Bayesian switching
state-space model. The proposed formulation leads to a tractable learning
procedure and to an efficient algorithm that simultaneously tracks gaze and
visual focus. The method is tested and benchmarked using two publicly available
datasets that contain typical multi-party human-robot and human-human
interactions.Comment: 15 pages, 8 figures, 6 table
- …