49,873 research outputs found
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
Embodied Question Answering
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where
an agent is spawned at a random location in a 3D environment and asked a
question ("What color is the car?"). In order to answer, the agent must first
intelligently navigate to explore the environment, gather information through
first-person (egocentric) vision, and then answer the question ("orange").
This challenging task requires a range of AI skills -- active perception,
language understanding, goal-driven navigation, commonsense reasoning, and
grounding of language into actions. In this work, we develop the environments,
end-to-end-trained reinforcement learning agents, and evaluation protocols for
EmbodiedQA.Comment: 20 pages, 13 figures, Webpage: https://embodiedqa.org
Learning by Asking Questions
We introduce an interactive learning framework for the development and
testing of intelligent visual systems, called learning-by-asking (LBA). We
explore LBA in context of the Visual Question Answering (VQA) task. LBA differs
from standard VQA training in that most questions are not observed during
training time, and the learner must ask questions it wants answers to. Thus,
LBA more closely mimics natural learning and has the potential to be more
data-efficient than the traditional VQA setting. We present a model that
performs LBA on the CLEVR dataset, and show that it automatically discovers an
easy-to-hard curriculum when learning interactively from an oracle. Our LBA
generated data consistently matches or outperforms the CLEVR train data and is
more sample efficient. We also show that our model asks questions that
generalize to state-of-the-art VQA models and to novel test time distributions
TallyQA: Answering Complex Counting Questions
Most counting questions in visual question answering (VQA) datasets are
simple and require no more than object detection. Here, we study algorithms for
complex counting questions that involve relationships between objects,
attribute identification, reasoning, and more. To do this, we created TallyQA,
the world's largest dataset for open-ended counting. We propose a new algorithm
for counting that uses relation networks with region proposals. Our method lets
relation networks be efficiently used with high-resolution imagery. It yields
state-of-the-art results compared to baseline and recent systems on both
TallyQA and the HowMany-QA benchmark.Comment: To appear in AAAI 2019 ( To download the dataset please go to
http://www.manojacharya.com/
SIMCO: SIMilarity-based object COunting
We present SIMCO, the first agnostic multi-class object counting approach.
SIMCO starts by detecting foreground objects through a novel Mask RCNN-based
architecture trained beforehand (just once) on a brand-new synthetic 2D shape
dataset, InShape; the idea is to highlight every object resembling a primitive
2D shape (circle, square, rectangle, etc.). Each object detected is described
by a low-dimensional embedding, obtained from a novel similarity-based head
branch; this latter implements a triplet loss, encouraging similar objects
(same 2D shape + color and scale) to map close. Subsequently, SIMCO uses this
embedding for clustering, so that different types of objects can emerge and be
counted, making SIMCO the very first multi-class unsupervised counter.
Experiments show that SIMCO provides state-of-the-art scores on counting
benchmarks and that it can also help in many challenging image understanding
tasks
- …