3,206 research outputs found
Recurrent Multimodal Interaction for Referring Image Segmentation
In this paper we are interested in the problem of image segmentation given
natural language descriptions, i.e. referring expressions. Existing works
tackle this problem by first modeling images and sentences independently and
then segment images by combining these two types of representations. We argue
that learning word-to-image interaction is more native in the sense of jointly
modeling two modalities for the image segmentation task, and we propose
convolutional multimodal LSTM to encode the sequential interactions between
individual words, visual information, and spatial information. We show that our
proposed model outperforms the baseline model on benchmark datasets. In
addition, we analyze the intermediate output of the proposed multimodal LSTM
approach and empirically explain how this approach enforces a more effective
word-to-image interaction.Comment: To appear in ICCV 2017. See http://www.cs.jhu.edu/~cxliu/ for code
and supplementary materia
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Referring image segmentation aims to predict the foreground mask of the
object referred by a natural language sentence. Multimodal context of the
sentence is crucial to distinguish the referent from the background. Existing
methods either insufficiently or redundantly model the multimodal context. To
tackle this problem, we propose a "gather-propagate-distribute" scheme to model
multimodal context by cross-modal interaction and implement this scheme as a
novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM
module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which
guides all the words to include valid multimodal context of the sentence while
excluding disturbing ones through three steps over the multimodal feature,
i.e., gathering, constrained propagation and distributing. Extensive
experiments on four benchmarks demonstrate that our method outperforms all the
previous state-of-the-arts.Comment: Accepted by ECCV 2020. Code is available at
https://github.com/spyflying/LSCM-Refse
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained
from large language or visual datasets has been successfully explored in recent
years. However, tasks such as visual question answering require combining these
vector representations with each other. Approaches to multimodal pooling
include element-wise product or sum, as well as concatenation of the visual and
textual representations. We hypothesize that these methods are not as
expressive as an outer product of the visual and textual vectors. As the outer
product is typically infeasible due to its high dimensionality, we instead
propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and
expressively combine multimodal features. We extensively evaluate MCB on the
visual question answering and grounding tasks. We consistently show the benefit
of MCB over ablations without MCB. For visual question answering, we present an
architecture which uses MCB twice, once for predicting attention over spatial
features and again to combine the attended representation with the question
representation. This model outperforms the state-of-the-art on the Visual7W
dataset and the VQA challenge.Comment: Accepted to EMNLP 201
- …