30,996 research outputs found
Fusion Based Holistic Road Scene Understanding
This paper addresses the problem of holistic road scene understanding based
on the integration of visual and range data. To achieve the grand goal, we
propose an approach that jointly tackles object-level image segmentation and
semantic region labeling within a conditional random field (CRF) framework.
Specifically, we first generate semantic object hypotheses by clustering 3D
points, learning their prior appearance models, and using a deep learning
method for reasoning their semantic categories. The learned priors, together
with spatial and geometric contexts, are incorporated in CRF. With this
formulation, visual and range data are fused thoroughly, and moreover, the
coupled segmentation and semantic labeling problem can be inferred via Graph
Cuts. Our approach is validated on the challenging KITTI dataset that contains
diverse complicated road scenarios. Both quantitative and qualitative
evaluations demonstrate its effectiveness.Comment: 14 pages,11 figure
3D ShapeNets: A Deep Representation for Volumetric Shapes
3D shape is a crucial but heavily underutilized cue in today's computer
vision systems, mostly due to the lack of a good generic shape representation.
With the recent availability of inexpensive 2.5D depth sensors (e.g. Microsoft
Kinect), it is becoming increasingly important to have a powerful 3D shape
representation in the loop. Apart from category recognition, recovering full 3D
shapes from view-based 2.5D depth maps is also a critical part of visual
understanding. To this end, we propose to represent a geometric 3D shape as a
probability distribution of binary variables on a 3D voxel grid, using a
Convolutional Deep Belief Network. Our model, 3D ShapeNets, learns the
distribution of complex 3D shapes across different object categories and
arbitrary poses from raw CAD data, and discovers hierarchical compositional
part representations automatically. It naturally supports joint object
recognition and shape completion from 2.5D depth maps, and it enables active
object recognition through view planning. To train our 3D deep learning model,
we construct ModelNet -- a large-scale 3D CAD model dataset. Extensive
experiments show that our 3D deep representation enables significant
performance improvement over the-state-of-the-arts in a variety of tasks.Comment: to be appeared in CVPR 201
Hierarchical Surface Prediction for 3D Object Reconstruction
Recently, Convolutional Neural Networks have shown promising results for 3D
geometry prediction. They can make predictions from very little input data such
as a single color image. A major limitation of such approaches is that they
only predict a coarse resolution voxel grid, which does not capture the surface
of the objects well. We propose a general framework, called hierarchical
surface prediction (HSP), which facilitates prediction of high resolution voxel
grids. The main insight is that it is sufficient to predict high resolution
voxels around the predicted surfaces. The exterior and interior of the objects
can be represented with coarse resolution voxels. Our approach is not dependent
on a specific input type. We show results for geometry prediction from color
images, depth images and shape completion from partial voxel grids. Our
analysis shows that our high resolution predictions are more accurate than low
resolution predictions.Comment: 3DV 201
Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes
The use of rendered images, whether from completely synthetic datasets or
from 3D reconstructions, is increasingly prevalent in vision tasks. However,
little attention has been given to how the selection of viewpoints affects the
performance of rendered training sets. In this paper, we propose a data-driven
approach to view set selection. Given a set of example images, we extract
statistics describing their contents and generate a set of views matching the
distribution of those statistics. Motivated by semantic segmentation tasks, we
model the spatial distribution of each semantic object category within an image
view volume. We provide a search algorithm that generates a sampling of likely
candidate views according to the example distribution, and a set selection
algorithm that chooses a subset of the candidates that jointly cover the
example distribution. Results of experiments with these algorithms on SUNCG
indicate that they are indeed able to produce view distributions similar to an
example set from NYUDv2 according to the earth mover's distance. Furthermore,
the selected views improve performance on semantic segmentation compared to
alternative view selection algorithms.Comment: ICCV submission, combined main paper and supplemental materia
3D Pose Estimation for Fine-Grained Object Categories
Existing object pose estimation datasets are related to generic object types
and there is so far no dataset for fine-grained object categories. In this
work, we introduce a new large dataset to benchmark pose estimation for
fine-grained objects, thanks to the availability of both 2D and 3D fine-grained
data recently. Specifically, we augment two popular fine-grained recognition
datasets (StanfordCars and CompCars) by finding a fine-grained 3D CAD model for
each sub-category and manually annotating each object in images with 3D pose.
We show that, with enough training data, a full perspective model with
continuous parameters can be estimated using 2D appearance information alone.
We achieve this via a framework based on Faster/Mask R-CNN. This goes beyond
previous works on category-level pose estimation, which only estimate
discrete/continuous viewpoint angles or recover rotation matrices often with
the help of key points. Furthermore, with fine-grained 3D models available, we
incorporate a dense 3D representation named as location field into the
CNN-based pose estimation framework to further improve the performance. The new
dataset is available at www.umiacs.umd.edu/~wym/3dpose.htmlComment: 4th International Workshop on Recovering 6D Object Pose (ECCVW 2018).
arXiv admin note: text overlap with arXiv:1810.0926
Humans and deep networks largely agree on which kinds of variation make object recognition harder
View-invariant object recognition is a challenging problem, which has
attracted much attention among the psychology, neuroscience, and computer
vision communities. Humans are notoriously good at it, even if some variations
are presumably more difficult to handle than others (e.g. 3D rotations). Humans
are thought to solve the problem through hierarchical processing along the
ventral stream, which progressively extracts more and more invariant visual
features. This feed-forward architecture has inspired a new generation of
bio-inspired computer vision systems called deep convolutional neural networks
(DCNN), which are currently the best algorithms for object recognition in
natural images. Here, for the first time, we systematically compared human
feed-forward vision and DCNNs at view-invariant object recognition using the
same images and controlling for both the kinds of transformation as well as
their magnitude. We used four object categories and images were rendered from
3D computer models. In total, 89 human subjects participated in 10 experiments
in which they had to discriminate between two or four categories after rapid
presentation with backward masking. We also tested two recent DCNNs on the same
tasks. We found that humans and DCNNs largely agreed on the relative
difficulties of each kind of variation: rotation in depth is by far the hardest
transformation to handle, followed by scale, then rotation in plane, and
finally position. This suggests that humans recognize objects mainly through 2D
template matching, rather than by constructing 3D object models, and that DCNNs
are not too unreasonable models of human feed-forward vision. Also, our results
show that the variation levels in rotation in depth and scale strongly modulate
both humans' and DCNNs' recognition performances. We thus argue that these
variations should be controlled in the image datasets used in vision research
Learning a Multi-View Stereo Machine
We present a learnt system for multi-view stereopsis. In contrast to recent
learning based methods for 3D reconstruction, we leverage the underlying 3D
geometry of the problem through feature projection and unprojection along
viewing rays. By formulating these operations in a differentiable manner, we
are able to learn the system end-to-end for the task of metric 3D
reconstruction. End-to-end learning allows us to jointly reason about shape
priors while conforming geometric constraints, enabling reconstruction from
much fewer images (even a single image) than required by classical approaches
as well as completion of unseen surfaces. We thoroughly evaluate our approach
on the ShapeNet dataset and demonstrate the benefits over classical approaches
as well as recent learning based methods
Lifting Object Detection Datasets into 3D
While data has certainly taken the center stage in computer vision in recent
years, it can still be difficult to obtain in certain scenarios. In particular,
acquiring ground truth 3D shapes of objects pictured in 2D images remains a
challenging feat and this has hampered progress in recognition-based object
reconstruction from a single image. Here we propose to bypass previous
solutions such as 3D scanning or manual design, that scale poorly, and instead
populate object category detection datasets semi-automatically with dense,
per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground
truth figure-ground segmentations and (iii) a small set of keypoint
annotations. Our proposed algorithm first estimates camera viewpoint using
rigid structure-from-motion and then reconstructs object shapes by optimizing
over visual hull proposals guided by loose within-class shape similarity
assumptions. The visual hull sampling process attempts to intersect an object's
projection cone with the cones of minimal subsets of other similar objects
among those pictured from certain vantage points. We show that our method is
able to produce convincing per-object 3D reconstructions and to accurately
estimate cameras viewpoints on one of the most challenging existing
object-category detection datasets, PASCAL VOC. We hope that our results will
re-stimulate interest on joint object recognition and 3D reconstruction from a
single image
3D Interpreter Networks for Viewer-Centered Wireframe Modeling
Understanding 3D object structure from a single image is an important but
challenging task in computer vision, mostly due to the lack of 3D object
annotations to real images. Previous research tackled this problem by either
searching for a 3D shape that best explains 2D annotations, or training purely
on synthetic data with ground truth 3D information. In this work, we propose 3D
INterpreter Networks (3D-INN), an end-to-end trainable framework that
sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses.
Our system learns from both 2D-annotated real images and synthetic 3D data.
This is made possible mainly by two technical innovations. First, heatmaps of
2D keypoints serve as an intermediate representation to connect real and
synthetic data. 3D-INN is trained on real images to estimate 2D keypoint
heatmaps from an input image; it then predicts 3D object structure from
heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN
benefits from the variation and abundance of synthetic 3D objects, without
suffering from the domain difference between real and synthesized images, often
due to imperfect rendering. Second, we propose a Projection Layer, mapping
estimated 3D structure back to 2D. During training, it ensures 3D-INN to
predict 3D structure whose projection is consistent with the 2D annotations to
real images. Experiments show that the proposed system performs well on both 2D
keypoint estimation and 3D structure recovery. We also demonstrate that the
recovered 3D information has wide vision applications, such as image retrieval.Comment: Journal preprint of arXiv:1604.08685 (IJCV, 2018). The first two
authors contributed equally to this work. Project page:
http://3dinterpreter.csail.mit.ed
Learning Local RGB-to-CAD Correspondences for Object Pose Estimation
We consider the problem of 3D object pose estimation. While much recent work
has focused on the RGB domain, the reliance on accurately annotated images
limits their generalizability and scalability. On the other hand, the easily
available CAD models of objects are rich sources of data, providing a large
number of synthetically rendered images. In this paper, we solve this key
problem of existing methods requiring expensive 3D pose annotations by
proposing a new method that matches RGB images to CAD models for object pose
estimation. Our key innovations compared to existing work include removing the
need for either real-world textures for CAD models or explicit 3D pose
annotations for RGB images. We achieve this through a series of objectives that
learn how to select keypoints and enforce viewpoint and modality invariance
across RGB images and CAD model renderings. We conduct extensive experiments to
demonstrate that the proposed method can reliably estimate object pose in RGB
images, as well as generalize to object instances not seen during training.Comment: 10 pages, 6 figures, 4 tables, ICCV 201
- …