30,996 research outputs found

    Fusion Based Holistic Road Scene Understanding

    Full text link
    This paper addresses the problem of holistic road scene understanding based on the integration of visual and range data. To achieve the grand goal, we propose an approach that jointly tackles object-level image segmentation and semantic region labeling within a conditional random field (CRF) framework. Specifically, we first generate semantic object hypotheses by clustering 3D points, learning their prior appearance models, and using a deep learning method for reasoning their semantic categories. The learned priors, together with spatial and geometric contexts, are incorporated in CRF. With this formulation, visual and range data are fused thoroughly, and moreover, the coupled segmentation and semantic labeling problem can be inferred via Graph Cuts. Our approach is validated on the challenging KITTI dataset that contains diverse complicated road scenarios. Both quantitative and qualitative evaluations demonstrate its effectiveness.Comment: 14 pages,11 figure

    3D ShapeNets: A Deep Representation for Volumetric Shapes

    Full text link
    3D shape is a crucial but heavily underutilized cue in today's computer vision systems, mostly due to the lack of a good generic shape representation. With the recent availability of inexpensive 2.5D depth sensors (e.g. Microsoft Kinect), it is becoming increasingly important to have a powerful 3D shape representation in the loop. Apart from category recognition, recovering full 3D shapes from view-based 2.5D depth maps is also a critical part of visual understanding. To this end, we propose to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network. Our model, 3D ShapeNets, learns the distribution of complex 3D shapes across different object categories and arbitrary poses from raw CAD data, and discovers hierarchical compositional part representations automatically. It naturally supports joint object recognition and shape completion from 2.5D depth maps, and it enables active object recognition through view planning. To train our 3D deep learning model, we construct ModelNet -- a large-scale 3D CAD model dataset. Extensive experiments show that our 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.Comment: to be appeared in CVPR 201

    Hierarchical Surface Prediction for 3D Object Reconstruction

    Full text link
    Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. Our approach is not dependent on a specific input type. We show results for geometry prediction from color images, depth images and shape completion from partial voxel grids. Our analysis shows that our high resolution predictions are more accurate than low resolution predictions.Comment: 3DV 201

    Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes

    Full text link
    The use of rendered images, whether from completely synthetic datasets or from 3D reconstructions, is increasingly prevalent in vision tasks. However, little attention has been given to how the selection of viewpoints affects the performance of rendered training sets. In this paper, we propose a data-driven approach to view set selection. Given a set of example images, we extract statistics describing their contents and generate a set of views matching the distribution of those statistics. Motivated by semantic segmentation tasks, we model the spatial distribution of each semantic object category within an image view volume. We provide a search algorithm that generates a sampling of likely candidate views according to the example distribution, and a set selection algorithm that chooses a subset of the candidates that jointly cover the example distribution. Results of experiments with these algorithms on SUNCG indicate that they are indeed able to produce view distributions similar to an example set from NYUDv2 according to the earth mover's distance. Furthermore, the selected views improve performance on semantic segmentation compared to alternative view selection algorithms.Comment: ICCV submission, combined main paper and supplemental materia

    3D Pose Estimation for Fine-Grained Object Categories

    Full text link
    Existing object pose estimation datasets are related to generic object types and there is so far no dataset for fine-grained object categories. In this work, we introduce a new large dataset to benchmark pose estimation for fine-grained objects, thanks to the availability of both 2D and 3D fine-grained data recently. Specifically, we augment two popular fine-grained recognition datasets (StanfordCars and CompCars) by finding a fine-grained 3D CAD model for each sub-category and manually annotating each object in images with 3D pose. We show that, with enough training data, a full perspective model with continuous parameters can be estimated using 2D appearance information alone. We achieve this via a framework based on Faster/Mask R-CNN. This goes beyond previous works on category-level pose estimation, which only estimate discrete/continuous viewpoint angles or recover rotation matrices often with the help of key points. Furthermore, with fine-grained 3D models available, we incorporate a dense 3D representation named as location field into the CNN-based pose estimation framework to further improve the performance. The new dataset is available at www.umiacs.umd.edu/~wym/3dpose.htmlComment: 4th International Workshop on Recovering 6D Object Pose (ECCVW 2018). arXiv admin note: text overlap with arXiv:1810.0926

    Humans and deep networks largely agree on which kinds of variation make object recognition harder

    Get PDF
    View-invariant object recognition is a challenging problem, which has attracted much attention among the psychology, neuroscience, and computer vision communities. Humans are notoriously good at it, even if some variations are presumably more difficult to handle than others (e.g. 3D rotations). Humans are thought to solve the problem through hierarchical processing along the ventral stream, which progressively extracts more and more invariant visual features. This feed-forward architecture has inspired a new generation of bio-inspired computer vision systems called deep convolutional neural networks (DCNN), which are currently the best algorithms for object recognition in natural images. Here, for the first time, we systematically compared human feed-forward vision and DCNNs at view-invariant object recognition using the same images and controlling for both the kinds of transformation as well as their magnitude. We used four object categories and images were rendered from 3D computer models. In total, 89 human subjects participated in 10 experiments in which they had to discriminate between two or four categories after rapid presentation with backward masking. We also tested two recent DCNNs on the same tasks. We found that humans and DCNNs largely agreed on the relative difficulties of each kind of variation: rotation in depth is by far the hardest transformation to handle, followed by scale, then rotation in plane, and finally position. This suggests that humans recognize objects mainly through 2D template matching, rather than by constructing 3D object models, and that DCNNs are not too unreasonable models of human feed-forward vision. Also, our results show that the variation levels in rotation in depth and scale strongly modulate both humans' and DCNNs' recognition performances. We thus argue that these variations should be controlled in the image datasets used in vision research

    Learning a Multi-View Stereo Machine

    Full text link
    We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches as well as recent learning based methods

    Lifting Object Detection Datasets into 3D

    Full text link
    While data has certainly taken the center stage in computer vision in recent years, it can still be difficult to obtain in certain scenarios. In particular, acquiring ground truth 3D shapes of objects pictured in 2D images remains a challenging feat and this has hampered progress in recognition-based object reconstruction from a single image. Here we propose to bypass previous solutions such as 3D scanning or manual design, that scale poorly, and instead populate object category detection datasets semi-automatically with dense, per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground truth figure-ground segmentations and (iii) a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion and then reconstructs object shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions. The visual hull sampling process attempts to intersect an object's projection cone with the cones of minimal subsets of other similar objects among those pictured from certain vantage points. We show that our method is able to produce convincing per-object 3D reconstructions and to accurately estimate cameras viewpoints on one of the most challenging existing object-category detection datasets, PASCAL VOC. We hope that our results will re-stimulate interest on joint object recognition and 3D reconstruction from a single image

    3D Interpreter Networks for Viewer-Centered Wireframe Modeling

    Full text link
    Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.Comment: Journal preprint of arXiv:1604.08685 (IJCV, 2018). The first two authors contributed equally to this work. Project page: http://3dinterpreter.csail.mit.ed

    Learning Local RGB-to-CAD Correspondences for Object Pose Estimation

    Full text link
    We consider the problem of 3D object pose estimation. While much recent work has focused on the RGB domain, the reliance on accurately annotated images limits their generalizability and scalability. On the other hand, the easily available CAD models of objects are rich sources of data, providing a large number of synthetically rendered images. In this paper, we solve this key problem of existing methods requiring expensive 3D pose annotations by proposing a new method that matches RGB images to CAD models for object pose estimation. Our key innovations compared to existing work include removing the need for either real-world textures for CAD models or explicit 3D pose annotations for RGB images. We achieve this through a series of objectives that learn how to select keypoints and enforce viewpoint and modality invariance across RGB images and CAD model renderings. We conduct extensive experiments to demonstrate that the proposed method can reliably estimate object pose in RGB images, as well as generalize to object instances not seen during training.Comment: 10 pages, 6 figures, 4 tables, ICCV 201
    • …
    corecore