35,945 research outputs found
Text to 3D Scene Generation with Rich Lexical Grounding
The ability to map descriptions of scenes to 3D geometric representations has
many applications in areas such as art, education, and robotics. However, prior
work on the text to 3D scene generation task has used manually specified object
categories and language that identifies them. We introduce a dataset of 3D
scenes annotated with natural language descriptions and learn from this data
how to ground textual descriptions to physical objects. Our method successfully
grounds a variety of lexical terms to concrete referents, and we show
quantitatively that our method improves 3D scene generation over previous work
using purely rule-based methods. We evaluate the fidelity and plausibility of
3D scenes generated with our grounding approach through human judgments. To
ease evaluation on this task, we also introduce an automated metric that
strongly correlates with human judgments.Comment: 10 pages, 7 figures, 3 tables. To appear in ACL-IJCNLP 201
Active Object Localization in Visual Situations
We describe a method for performing active localization of objects in
instances of visual situations. A visual situation is an abstract
concept---e.g., "a boxing match", "a birthday party", "walking the dog",
"waiting for a bus"---whose image instantiations are linked more by their
common spatial and semantic structure than by low-level visual similarity. Our
system combines given and learned knowledge of the structure of a particular
situation, and adapts that knowledge to a new situation instance as it actively
searches for objects. More specifically, the system learns a set of probability
distributions describing spatial and other relationships among relevant
objects. The system uses those distributions to iteratively sample object
proposals on a test image, but also continually uses information from those
object proposals to adaptively modify the distributions based on what the
system has detected. We test our approach's ability to efficiently localize
objects, using a situation-specific image dataset created by our group. We
compare the results with several baselines and variations on our method, and
demonstrate the strong benefit of using situation knowledge and active
context-driven localization. Finally, we contrast our method with several other
approaches that use context as well as active search for object localization in
images.Comment: 14 page
Sim2Real View Invariant Visual Servoing by Recurrent Control
Humans are remarkably proficient at controlling their limbs and tools from a
wide range of viewpoints and angles, even in the presence of optical
distortions. In robotics, this ability is referred to as visual servoing:
moving a tool or end-point to a desired location using primarily visual
feedback. In this paper, we study how viewpoint-invariant visual servoing
skills can be learned automatically in a robotic manipulation scenario. To this
end, we train a deep recurrent controller that can automatically determine
which actions move the end-point of a robotic arm to a desired object. The
problem that must be solved by this controller is fundamentally ambiguous:
under severe variation in viewpoint, it may be impossible to determine the
actions in a single feedforward operation. Instead, our visual servoing system
must use its memory of past movements to understand how the actions affect the
robot motion from the current viewpoint, correcting mistakes and gradually
moving closer to the target. This ability is in stark contrast to most visual
servoing methods, which either assume known dynamics or require a calibration
phase. We show how we can learn this recurrent controller using simulated data
and a reinforcement learning objective. We then describe how the resulting
model can be transferred to a real-world robot by disentangling perception from
control and only adapting the visual layers. The adapted model can servo to
previously unseen objects from novel viewpoints on a real-world Kuka IIWA
robotic arm. For supplementary videos, see:
https://fsadeghi.github.io/Sim2RealViewInvariantServoComment: Supplementary video:
https://fsadeghi.github.io/Sim2RealViewInvariantServ
Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars -- sound conveys
important information about the objects in our surroundings. In this work, we
show that ambient sounds can be used as a supervisory signal for learning
visual models. To demonstrate this, we train a convolutional neural network to
predict a statistical summary of the sound associated with a video frame. We
show that, through this process, the network learns a representation that
conveys information about objects and scenes. We evaluate this representation
on several recognition tasks, finding that its performance is comparable to
that of other state-of-the-art unsupervised learning methods. Finally, we show
through visualizations that the network learns units that are selective to
objects that are often associated with characteristic sounds.Comment: ECCV 201
Learning to Infer Graphics Programs from Hand-Drawn Images
We introduce a model that learns to convert simple hand drawings into
graphics programs written in a subset of \LaTeX. The model combines techniques
from deep learning and program synthesis. We learn a convolutional neural
network that proposes plausible drawing primitives that explain an image. These
drawing primitives are like a trace of the set of primitive commands issued by
a graphics program. We learn a model that uses program synthesis techniques to
recover a graphics program from that trace. These programs have constructs like
variable bindings, iterative loops, or simple kinds of conditionals. With a
graphics program in hand, we can correct errors made by the deep network,
measure similarity between drawings by use of similar high-level geometric
structures, and extrapolate drawings. Taken together these results are a step
towards agents that induce useful, human-readable programs from perceptual
input
- …