17,186 research outputs found
Plan3D: Viewpoint and Trajectory Optimization for Aerial Multi-View Stereo Reconstruction
We introduce a new method that efficiently computes a set of viewpoints and
trajectories for high-quality 3D reconstructions in outdoor environments. Our
goal is to automatically explore an unknown area, and obtain a complete 3D scan
of a region of interest (e.g., a large building). Images from a commodity RGB
camera, mounted on an autonomously navigated quadcopter, are fed into a
multi-view stereo reconstruction pipeline that produces high-quality results
but is computationally expensive. In this setting, the scanning result is
constrained by the restricted flight time of quadcopters. To this end, we
introduce a novel optimization strategy that respects these constraints by
maximizing the information gain from sparsely-sampled view points while
limiting the total travel distance of the quadcopter. At the core of our method
lies a hierarchical volumetric representation that allows the algorithm to
distinguish between unknown, free, and occupied space. Furthermore, our
information gain based formulation leverages this representation to handle
occlusions in an efficient manner. In addition to the surface geometry, we
utilize the free-space information to avoid obstacles and determine
collision-free flight paths. Our tool can be used to specify the region of
interest and to plan trajectories. We demonstrate our method by obtaining a
number of compelling 3D reconstructions, and provide a thorough quantitative
evaluation showing improvement over previous state-of-the-art and regular
patterns.Comment: 31 pages, 12 figures, 9 table
UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World
Synthetic data has been a critical tool for training scene text detection and
recognition models. On the one hand, synthetic word images have proven to be a
successful substitute for real images in training scene text recognizers. On
the other hand, however, scene text detectors still heavily rely on a large
amount of manually annotated real-world images, which are expensive. In this
paper, we introduce UnrealText, an efficient image synthesis method that
renders realistic images via a 3D graphics engine. 3D synthetic engine provides
realistic appearance by rendering scene and text as a whole, and allows for
better text region proposals with access to precise scene information, e.g.
normal and even object meshes. The comprehensive experiments verify its
effectiveness on both scene text detection and recognition. We also generate a
multilingual version for future research into multilingual scene text detection
and recognition. Additionally, we re-annotate scene text recognition datasets
in a case-sensitive way and include punctuation marks for more comprehensive
evaluations. The code and the generated datasets are released at:
https://github.com/Jyouhou/UnrealText/ .Comment: adding experiments with Mask-RCN
Casting Geometric Constraints in Semantic Segmentation as Semi-Supervised Learning
We propose a simple yet effective method to learn to segment new indoor
scenes from video frames: State-of-the-art methods trained on one dataset, even
as large as the SUNRGB-D dataset, can perform poorly when applied to images
that are not part of the dataset, because of the dataset bias, a common
phenomenon in computer vision. To make semantic segmentation more useful in
practice, one can exploit geometric constraints. Our main contribution is to
show that these constraints can be cast conveniently as semi-supervised terms,
which enforce the fact that the same class should be predicted for the
projections of the same 3D location in different images. This is interesting as
we can exploit general existing techniques developed for semi-supervised
learning to efficiently incorporate the constraints. We show that this approach
can efficiently and accurately learn to segment target sequences of ScanNet and
our own target sequences using only annotations from SUNRGB-D, and geometric
relations between the video frames of target sequences.Comment: To be presented at WACV 202
Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
A major impediment in rapidly deploying object detection models for instance
detection is the lack of large annotated datasets. For example, finding a large
labeled dataset containing instances in a particular kitchen is unlikely. Each
new environment with new instances requires expensive data collection and
annotation. In this paper, we propose a simple approach to generate large
annotated instance datasets with minimal effort. Our key insight is that
ensuring only patch-level realism provides enough training signal for current
object detector models. We automatically `cut' object instances and `paste'
them on random backgrounds. A naive way to do this results in pixel artifacts
which result in poor performance for trained models. We show how to make
detectors ignore these artifacts during training and generate data that gives
competitive performance on real data. Our method outperforms existing synthesis
approaches and when combined with real images improves relative performance by
more than 21% on benchmark datasets. In a cross-domain setting, our synthetic
data combined with just 10% real data outperforms models trained on all real
data.Comment: To appear in ICCV 201
Learn-to-Score: Efficient 3D Scene Exploration by Predicting View Utility
Camera equipped drones are nowadays being used to explore large scenes and
reconstruct detailed 3D maps. When free space in the scene is approximately
known, an offline planner can generate optimal plans to efficiently explore the
scene. However, for exploring unknown scenes, the planner must predict and
maximize usefulness of where to go on the fly. Traditionally, this has been
achieved using handcrafted utility functions. We propose to learn a better
utility function that predicts the usefulness of future viewpoints. Our learned
utility function is based on a 3D convolutional neural network. This network
takes as input a novel volumetric scene representation that implicitly captures
previously visited viewpoints and generalizes to new scenes. We evaluate our
method on several large 3D models of urban scenes using simulated depth
cameras. We show that our method outperforms existing utility measures in terms
of reconstruction performance and is robust to sensor noise.Comment: 16 pages, 7 figures, 5 table
SilhoNet: An RGB Method for 6D Object Pose Estimation
Autonomous robot manipulation involves estimating the translation and
orientation of the object to be manipulated as a 6-degree-of-freedom (6D) pose.
Methods using RGB-D data have shown great success in solving this problem.
However, there are situations where cost constraints or the working environment
may limit the use of RGB-D sensors. When limited to monocular camera data only,
the problem of object pose estimation is very challenging. In this work, we
introduce a novel method called SilhoNet that predicts 6D object pose from
monocular images. We use a Convolutional Neural Network (CNN) pipeline that
takes in Region of Interest (ROI) proposals to simultaneously predict an
intermediate silhouette representation for objects with an associated occlusion
mask and a 3D translation vector. The 3D orientation is then regressed from the
predicted silhouettes. We show that our method achieves better overall
performance on the YCB-Video dataset than two state-of-the art networks for 6D
pose estimation from monocular image input.Comment: 8 pages, 3 figure
3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction
Inspired by the recent success of methods that employ shape priors to achieve
robust 3D reconstructions, we propose a novel recurrent neural network
architecture that we call the 3D Recurrent Reconstruction Neural Network
(3D-R2N2). The network learns a mapping from images of objects to their
underlying 3D shapes from a large collection of synthetic data. Our network
takes in one or more images of an object instance from arbitrary viewpoints and
outputs a reconstruction of the object in the form of a 3D occupancy grid.
Unlike most of the previous works, our network does not require any image
annotations or object class labels for training or testing. Our extensive
experimental analysis shows that our reconstruction framework i) outperforms
the state-of-the-art methods for single view reconstruction, and ii) enables
the 3D reconstruction of objects in situations when traditional SFM/SLAM
methods fail (because of lack of texture and/or wide baseline).Comment: Appendix can be found at
http://cvgl.stanford.edu/papers/choy_16_appendix.pd
SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again
We present a novel method for detecting 3D model instances and estimating
their 6D poses from RGB data in a single shot. To this end, we extend the
popular SSD paradigm to cover the full 6D pose space and train on synthetic
model data only. Our approach competes or surpasses current state-of-the-art
methods that leverage RGB-D data on multiple challenging datasets. Furthermore,
our method produces these results at around 10Hz, which is many times faster
than the related methods. For the sake of reproducibility, we make our trained
networks and detection code publicly available.Comment: The first two authors contributed equally to this wor
NRMVS: Non-Rigid Multi-View Stereo
Scene reconstruction from unorganized RGB images is an important task in many
computer vision applications. Multi-view Stereo (MVS) is a common solution in
photogrammetry applications for the dense reconstruction of a static scene. The
static scene assumption, however, limits the general applicability of MVS
algorithms, as many day-to-day scenes undergo non-rigid motion, e.g., clothes,
faces, or human bodies. In this paper, we open up a new challenging direction:
dense 3D reconstruction of scenes with non-rigid changes observed from
arbitrary, sparse, and wide-baseline views. We formulate the problem as a joint
optimization of deformation and depth estimation, using deformation graphs as
the underlying representation. We propose a new sparse 3D to 2D matching
technique, together with a dense patch-match evaluation scheme to estimate
deformation and depth with photometric consistency. We show that creating a
dense 4D structure from a few RGB images with non-rigid changes is possible,
and demonstrate that our method can be used to interpolate novel deformed
scenes from various combinations of these deformation estimates derived from
the sparse views
Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes
The use of rendered images, whether from completely synthetic datasets or
from 3D reconstructions, is increasingly prevalent in vision tasks. However,
little attention has been given to how the selection of viewpoints affects the
performance of rendered training sets. In this paper, we propose a data-driven
approach to view set selection. Given a set of example images, we extract
statistics describing their contents and generate a set of views matching the
distribution of those statistics. Motivated by semantic segmentation tasks, we
model the spatial distribution of each semantic object category within an image
view volume. We provide a search algorithm that generates a sampling of likely
candidate views according to the example distribution, and a set selection
algorithm that chooses a subset of the candidates that jointly cover the
example distribution. Results of experiments with these algorithms on SUNCG
indicate that they are indeed able to produce view distributions similar to an
example set from NYUDv2 according to the earth mover's distance. Furthermore,
the selected views improve performance on semantic segmentation compared to
alternative view selection algorithms.Comment: ICCV submission, combined main paper and supplemental materia
- …