2,261 research outputs found
Synthesizing Training Data for Object Detection in Indoor Scenes
Detection of objects in cluttered indoor environments is one of the key
enabling functionalities for service robots. The best performing object
detection approaches in computer vision exploit deep Convolutional Neural
Networks (CNN) to simultaneously detect and categorize the objects of interest
in cluttered scenes. Training of such models typically requires large amounts
of annotated training data which is time consuming and costly to obtain. In
this work we explore the ability of using synthetically generated composite
images for training state-of-the-art object detectors, especially for object
instance detection. We superimpose 2D images of textured object models into
images of real environments at variety of locations and scales. Our experiments
evaluate different superimposition strategies ranging from purely image-based
blending all the way to depth and semantics informed positioning of the object
models into real scenes. We demonstrate the effectiveness of these object
detector training strategies on two publicly available datasets, the
GMU-Kitchens and the Washington RGB-D Scenes v2. As one observation, augmenting
some hand-labeled training data with synthetic examples carefully composed onto
scenes yields object detectors with comparable performance to using much more
hand-labeled data. Broadly, this work charts new opportunities for training
detectors for new objects by exploiting existing object model repositories in
either a purely automatic fashion or with only a very small number of
human-annotated examples.Comment: Added more experiments and link to project webpag
Object segmentation in depth maps with one user click and a synthetically trained fully convolutional network
With more and more household objects built on planned obsolescence and
consumed by a fast-growing population, hazardous waste recycling has become a
critical challenge. Given the large variability of household waste, current
recycling platforms mostly rely on human operators to analyze the scene,
typically composed of many object instances piled up in bulk. Helping them by
robotizing the unitary extraction is a key challenge to speed up this tedious
process. Whereas supervised deep learning has proven very efficient for such
object-level scene understanding, e.g., generic object detection and
segmentation in everyday scenes, it however requires large sets of per-pixel
labeled images, that are hardly available for numerous application contexts,
including industrial robotics. We thus propose a step towards a practical
interactive application for generating an object-oriented robotic grasp,
requiring as inputs only one depth map of the scene and one user click on the
next object to extract. More precisely, we address in this paper the middle
issue of object seg-mentation in top views of piles of bulk objects given a
pixel location, namely seed, provided interactively by a human operator. We
propose a twofold framework for generating edge-driven instance segments.
First, we repurpose a state-of-the-art fully convolutional object contour
detector for seed-based instance segmentation by introducing the notion of
edge-mask duality with a novel patch-free and contour-oriented loss function.
Second, we train one model using only synthetic scenes, instead of manually
labeled training data. Our experimental results show that considering edge-mask
duality for training an encoder-decoder network, as we suggest, outperforms a
state-of-the-art patch-based network in the present application context.Comment: This is a pre-print of an article published in Human Friendly
Robotics, 10th International Workshop, Springer Proceedings in Advanced
Robotics, vol 7. The final authenticated version is available online at:
https://doi.org/10.1007/978-3-319-89327-3\_16, Springer Proceedings in
Advanced Robotics, Siciliano Bruno, Khatib Oussama, In press, Human Friendly
Robotics, 10th International Workshop,
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding
While deep neural networks have led to human-level performance on computer
vision tasks, they have yet to demonstrate similar gains for holistic scene
understanding. In particular, 3D context has been shown to be an extremely
important cue for scene understanding - yet very little research has been done
on integrating context information with deep models. This paper presents an
approach to embed 3D context into the topology of a neural network trained to
perform holistic scene understanding. Given a depth image depicting a 3D scene,
our network aligns the observed scene with a predefined 3D scene template, and
then reasons about the existence and location of each object within the scene
template. In doing so, our model recognizes multiple objects in a single
forward pass of a 3D convolutional neural network, capturing both global scene
and local object information simultaneously. To create training data for this
3D network, we generate partly hallucinated depth images which are rendered by
replacing real objects with a repository of CAD models of the same object
category. Extensive experiments demonstrate the effectiveness of our algorithm
compared to the state-of-the-arts. Source code and data are available at
http://deepcontext.cs.princeton.edu.Comment: Accepted by ICCV201
Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes
The success of deep learning in computer vision is based on availability of
large annotated datasets. To lower the need for hand labeled images, virtually
rendered 3D worlds have recently gained popularity. Creating realistic 3D
content is challenging on its own and requires significant human effort. In
this work, we propose an alternative paradigm which combines real and synthetic
data for learning semantic instance segmentation and object detection models.
Exploiting the fact that not all aspects of the scene are equally important for
this task, we propose to augment real-world imagery with virtual objects of the
target category. Capturing real-world images at large scale is easy and cheap,
and directly provides real background appearances without the need for creating
complex 3D models of the environment. We present an efficient procedure to
augment real images with virtual objects. This allows us to create realistic
composite images which exhibit both realistic background appearance and a large
number of complex object arrangements. In contrast to modeling complete 3D
environments, our augmentation approach requires only a few user interactions
in combination with 3D shapes of the target object. Through extensive
experimentation, we conclude the right set of parameters to produce augmented
data which can maximally enhance the performance of instance segmentation
models. Further, we demonstrate the utility of our approach on training
standard deep models for semantic instance segmentation and object detection of
cars in outdoor driving scenes. We test the models trained on our augmented
data on the KITTI 2015 dataset, which we have annotated with pixel-accurate
ground truth, and on Cityscapes dataset. Our experiments demonstrate that
models trained on augmented imagery generalize better than those trained on
synthetic data or models trained on limited amount of annotated real data
- …