79,205 research outputs found
A system that learns to recognize 3-D objects
A system that learns to recognize 3-D objects from single and
multiple views is presented. It consists of three parts: a simulator
of 3-D figures, a Learner, and a recognizer.
The 3-D figure simulator generates and plots line drawings of
certain 3-D objects. A series of transformations leads to a number of
2-D images of a 3-D object, which are considered as different views
and are the basic input to the next two parts.
The learner works in three stages using the method of Learning
from examples. In the first stage an elementary-concept learner learns
the basic entities that make up a line drawing. In the second stage a
multiple-view learner learns the definitions of 3-D objects that are to
be recognized from multiple views. In the third stage a single-view
learner learns how to recognize the same objects from single views.
The recognizer is presented with line drawings representing 3-D
scenes. A single-view recognizer segments the input into faces of
possible 3-D objects, and attempts to match the segmented scene with a
set of single-view definitions of 3-D objects. The result of the
recognition may include several alternative answers, corresponding to
different 3-D objects. A unique answer can be obtained by making
assumptions about hidden elements (e. g. faces) of an object and using a
multiple-view recognizer. Both single-view and multiple-view recognition
are based on the structural relations of the elements that make up a
3-D object. Some analytical elements (e. g. angles) of the objects are
also calculated, in order to determine point containment and conveziti.
The system performs well on polyhedra with triangular and
quadrilateral faces. A discussion of the system's performance and
suggestions for further development is given at the end.
The simulator and the part of the recognizer that makes the
analytical calculations are written in C. The learner and the rest
of the recognizer are written in PROLOG
Unsupervised Discovery of Parts, Structure, and Dynamics
Humans easily recognize object parts and their hierarchical structure by
watching how they move; they can then predict how each part moves in the
future. In this paper, we propose a novel formulation that simultaneously
learns a hierarchical, disentangled object representation and a dynamics model
for object parts from unlabeled videos. Our Parts, Structure, and Dynamics
(PSD) model learns to, first, recognize the object parts via a layered image
representation; second, predict hierarchy via a structural descriptor that
composes low-level concepts into a hierarchical structure; and third, model the
system dynamics by predicting the future. Experiments on multiple real and
synthetic datasets demonstrate that our PSD model works well on all three
tasks: segmenting object parts, building their hierarchical structure, and
capturing their motion distributions.Comment: ICLR 2019. The first two authors contributed equally to this wor
Active Object Localization in Visual Situations
We describe a method for performing active localization of objects in
instances of visual situations. A visual situation is an abstract
concept---e.g., "a boxing match", "a birthday party", "walking the dog",
"waiting for a bus"---whose image instantiations are linked more by their
common spatial and semantic structure than by low-level visual similarity. Our
system combines given and learned knowledge of the structure of a particular
situation, and adapts that knowledge to a new situation instance as it actively
searches for objects. More specifically, the system learns a set of probability
distributions describing spatial and other relationships among relevant
objects. The system uses those distributions to iteratively sample object
proposals on a test image, but also continually uses information from those
object proposals to adaptively modify the distributions based on what the
system has detected. We test our approach's ability to efficiently localize
objects, using a situation-specific image dataset created by our group. We
compare the results with several baselines and variations on our method, and
demonstrate the strong benefit of using situation knowledge and active
context-driven localization. Finally, we contrast our method with several other
approaches that use context as well as active search for object localization in
images.Comment: 14 page
Semantic Image Retrieval via Active Grounding of Visual Situations
We describe a novel architecture for semantic image retrieval---in
particular, retrieval of instances of visual situations. Visual situations are
concepts such as "a boxing match," "walking the dog," "a crowd waiting for a
bus," or "a game of ping-pong," whose instantiations in images are linked more
by their common spatial and semantic structure than by low-level visual
similarity. Given a query situation description, our architecture---called
Situate---learns models capturing the visual features of expected objects as
well the expected spatial configuration of relationships among objects. Given a
new image, Situate uses these models in an attempt to ground (i.e., to create a
bounding box locating) each expected component of the situation in the image
via an active search procedure. Situate uses the resulting grounding to compute
a score indicating the degree to which the new image is judged to contain an
instance of the situation. Such scores can be used to rank images in a
collection as part of a retrieval system. In the preliminary study described
here, we demonstrate the promise of this system by comparing Situate's
performance with that of two baseline methods, as well as with a related
semantic image-retrieval system based on "scene graphs.
- …