522 research outputs found
TossingBot: Learning to Throw Arbitrary Objects with Residual Physics
We investigate whether a robot arm can learn to pick and throw arbitrary
objects into selected boxes quickly and accurately. Throwing has the potential
to increase the physical reachability and picking speed of a robot arm.
However, precisely throwing arbitrary objects in unstructured settings presents
many challenges: from acquiring reliable pre-throw conditions (e.g. initial
pose of object in manipulator) to handling varying object-centric properties
(e.g. mass distribution, friction, shape) and dynamics (e.g. aerodynamics). In
this work, we propose an end-to-end formulation that jointly learns to infer
control parameters for grasping and throwing motion primitives from visual
observations (images of arbitrary objects in a bin) through trial and error.
Within this formulation, we investigate the synergies between grasping and
throwing (i.e., learning grasps that enable more accurate throws) and between
simulation and deep learning (i.e., using deep networks to predict residuals on
top of control parameters predicted by a physics simulator). The resulting
system, TossingBot, is able to grasp and throw arbitrary objects into boxes
located outside its maximum reach range at 500+ mean picks per hour (600+
grasps per hour with 85% throwing accuracy); and generalizes to new objects and
target locations. Videos are available at https://tossingbot.cs.princeton.eduComment: Summary Video: https://youtu.be/f5Zn2Up2RjQ Project webpage:
https://tossingbot.cs.princeton.ed
Learning with Latent Language
The named concepts and compositional operators present in natural language
provide a rich source of information about the kinds of abstractions humans use
to navigate the world. Can this linguistic background knowledge improve the
generality and efficiency of learned classifiers and control policies? This
paper aims to show that using the space of natural language strings as a
parameter space is an effective way to capture natural task structure. In a
pretraining phase, we learn a language interpretation model that transforms
inputs (e.g. images) into outputs (e.g. labels) given natural language
descriptions. To learn a new concept (e.g. a classifier), we search directly in
the space of descriptions to minimize the interpreter's loss on training
examples. Crucially, our models do not require language data to learn these
concepts: language is used only in pretraining to impose structure on
subsequent learning. Results on image classification, text editing, and
reinforcement learning show that, in all settings, models with a linguistic
parameterization outperform those without
New ideas and trends in deep multimodal content understanding: a review
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.Computer Systems, Imagery and Medi
Your "Flamingo" is My "Bird": Fine-Grained, or Not
Whether what you see in Figure 1 is a "flamingo" or a "bird", is the question
we ask in this paper. While fine-grained visual classification (FGVC) strives
to arrive at the former, for the majority of us non-experts just "bird" would
probably suffice. The real question is therefore -- how can we tailor for
different fine-grained definitions under divergent levels of expertise. For
that, we re-envisage the traditional setting of FGVC, from single-label
classification, to that of top-down traversal of a pre-defined coarse-to-fine
label hierarchy -- so that our answer becomes
"bird"-->"Phoenicopteriformes"-->"Phoenicopteridae"-->"flamingo". To approach
this new problem, we first conduct a comprehensive human study where we confirm
that most participants prefer multi-granularity labels, regardless whether they
consider themselves experts. We then discover the key intuition that:
coarse-level label prediction exacerbates fine-grained feature learning, yet
fine-level feature betters the learning of coarse-level classifier. This
discovery enables us to design a very simple albeit surprisingly effective
solution to our new problem, where we (i) leverage level-specific
classification heads to disentangle coarse-level features with fine-grained
ones, and (ii) allow finer-grained features to participate in coarser-grained
label predictions, which in turn helps with better disentanglement. Experiments
show that our method achieves superior performance in the new FGVC setting, and
performs better than state-of-the-art on traditional single-label FGVC problem
as well. Thanks to its simplicity, our method can be easily implemented on top
of any existing FGVC frameworks and is parameter-free.Comment: Accepted as an oral of CVPR2021. Code:
https://github.com/PRIS-CV/Fine-Grained-or-No
- …