78 research outputs found

    NOVEL DENSE STEREO ALGORITHMS FOR HIGH-QUALITY DEPTH ESTIMATION FROM IMAGES

    Get PDF
    This dissertation addresses the problem of inferring scene depth information from a collection of calibrated images taken from different viewpoints via stereo matching. Although it has been heavily investigated for decades, depth from stereo remains a long-standing challenge and popular research topic for several reasons. First of all, in order to be of practical use for many real-time applications such as autonomous driving, accurate depth estimation in real-time is of great importance and one of the core challenges in stereo. Second, for applications such as 3D reconstruction and view synthesis, high-quality depth estimation is crucial to achieve photo realistic results. However, due to the matching ambiguities, accurate dense depth estimates are difficult to achieve. Last but not least, most stereo algorithms rely on identification of corresponding points among images and only work effectively when scenes are Lambertian. For non-Lambertian surfaces, the brightness constancy assumption is no longer valid. This dissertation contributes three novel stereo algorithms that are motivated by the specific requirements and limitations imposed by different applications. In addressing high speed depth estimation from images, we present a stereo algorithm that achieves high quality results while maintaining real-time performance. We introduce an adaptive aggregation step in a dynamic-programming framework. Matching costs are aggregated in the vertical direction using a computationally expensive weighting scheme based on color and distance proximity. We utilize the vector processing capability and parallelism in commodity graphics hardware to speed up this process over two orders of magnitude. In addressing high accuracy depth estimation, we present a stereo model that makes use of constraints from points with known depths - the Ground Control Points (GCPs) as referred to in stereo literature. Our formulation explicitly models the influences of GCPs in a Markov Random Field. A novel regularization prior is naturally integrated into a global inference framework in a principled way using the Bayes rule. Our probabilistic framework allows GCPs to be obtained from various modalities and provides a natural way to integrate information from various sensors. In addressing non-Lambertian reflectance, we introduce a new invariant for stereo correspondence which allows completely arbitrary scene reflectance (bidirectional reflectance distribution functions - BRDFs). This invariant can be used to formulate a rank constraint on stereo matching when the scene is observed by several lighting configurations in which only the lighting intensity varies

    Locally Adaptive Stereo Vision Based 3D Visual Reconstruction

    Get PDF
    abstract: Using stereo vision for 3D reconstruction and depth estimation has become a popular and promising research area as it has a simple setup with passive cameras and relatively efficient processing procedure. The work in this dissertation focuses on locally adaptive stereo vision methods and applications to different imaging setups and image scenes. Solder ball height and substrate coplanarity inspection is essential to the detection of potential connectivity issues in semi-conductor units. Current ball height and substrate coplanarity inspection tools are expensive and slow, which makes them difficult to use in a real-time manufacturing setting. In this dissertation, an automatic, stereo vision based, in-line ball height and coplanarity inspection method is presented. The proposed method includes an imaging setup together with a computer vision algorithm for reliable, in-line ball height measurement. The imaging setup and calibration, ball height estimation and substrate coplanarity calculation are presented with novel stereo vision methods. The results of the proposed method are evaluated in a measurement capability analysis (MCA) procedure and compared with the ground-truth obtained by an existing laser scanning tool and an existing confocal inspection tool. The proposed system outperforms existing inspection tools in terms of accuracy and stability. In a rectified stereo vision system, stereo matching methods can be categorized into global methods and local methods. Local stereo methods are more suitable for real-time processing purposes with competitive accuracy as compared with global methods. This work proposes a stereo matching method based on sparse locally adaptive cost aggregation. In order to reduce outlier disparity values that correspond to mis-matches, a novel sparse disparity subset selection method is proposed by assigning a significance status to candidate disparity values, and selecting the significant disparity values adaptively. An adaptive guided filtering method using the disparity subset for refined cost aggregation and disparity calculation is demonstrated. The proposed stereo matching algorithm is tested on the Middlebury and the KITTI stereo evaluation benchmark images. A performance analysis of the proposed method in terms of the I0 norm of the disparity subset is presented to demonstrate the achieved efficiency and accuracy.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    A real-time low-cost vision sensor for robotic bin picking

    Get PDF
    This thesis presents an integrated approach of a vision sensor for bin picking. The vision system that has been devised consists of three major components. The first addresses the implementation of a bifocal range sensor which estimates the depth by measuring the relative blurring between two images captured with different focal settings. A key element in the success of this approach is that it overcomes some of the limitations that were associated with other related implementations and the experimental results indicate that the precision offered by the sensor discussed in this thesis is precise enough for a large variety of industrial applications. The second component deals with the implementation of an edge-based segmentation technique which is applied in order to detect the boundaries of the objects that define the scene. An important issue related to this segmentation technique consists of minimising the errors in the edge detected output, an operation that is carried out by analysing the information associated with the singular edge points. The last component addresses the object recognition and pose estimation using the information resulting from the application of the segmentation algorithm. The recognition stage consists of matching the primitives derived from the scene regions, while the pose estimation is addressed using an appearance-based approach augmented with a range data analysis. The developed system is suitable for real-time operation and in order to demonstrate the validity of the proposed approach it has been examined under varying real-world scenes

    Learning Deep Object Detectors from 3D Models

    Full text link
    Crowdsourced 3D CAD models are becoming easily accessible online, and can potentially generate an infinite number of training images for almost any object category.We show that augmenting the training data of contemporary Deep Convolutional Neural Net (DCNN) models with such synthetic data can be effective, especially when real training data is limited or not well matched to the target domain. Most freely available CAD models capture 3D shape but are often missing other low level cues, such as realistic object texture, pose, or background. In a detailed analysis, we use synthetic CAD-rendered images to probe the ability of DCNN to learn without these cues, with surprising findings. In particular, we show that when the DCNN is fine-tuned on the target detection task, it exhibits a large degree of invariance to missing low-level cues, but, when pretrained on generic ImageNet classification, it learns better when the low-level cues are simulated. We show that our synthetic DCNN training approach significantly outperforms previous methods on the PASCAL VOC2007 dataset when learning in the few-shot scenario and improves performance in a domain shift scenario on the Office benchmark

    Depth Enhancement and Surface Reconstruction with RGB/D Sequence

    Get PDF
    Surface reconstruction and 3D modeling is a challenging task, which has been explored for decades by the computer vision, computer graphics, and machine learning communities. It is fundamental to many applications such as robot navigation, animation and scene understanding, industrial control and medical diagnosis. In this dissertation, I take advantage of the consumer depth sensors for surface reconstruction. Considering its limited performance on capturing detailed surface geometry, a depth enhancement approach is proposed in the first place to recovery small and rich geometric details with captured depth and color sequence. In addition to enhancing its spatial resolution, I present a hybrid camera to improve the temporal resolution of consumer depth sensor and propose an optimization framework to capture high speed motion and generate high speed depth streams. Given the partial scans from the depth sensor, we also develop a novel fusion approach to build up complete and watertight human models with a template guided registration method. Finally, the problem of surface reconstruction for non-Lambertian objects, on which the current depth sensor fails, is addressed by exploiting multi-view images captured with a hand-held color camera and we propose a visual hull based approach to recovery the 3D model

    Multi-view feature engineering and learning

    Full text link
    We frame the problem of local representation of imaging data as the computation of minimal sufficient statistics that are invariant to nuisance variability induced by viewpoint and illumination. We show that, under very stringent condi-tions, these are related to “feature descriptors ” commonly used in Computer Vision. Such conditions can be relaxed if multiple views of the same scene are available. We pro-pose a sampling-based and a point-estimate based approx-imation of such a representation, compared empirically on image-to-(multiple)image matching, for which we introduce a multi-view wide-baseline matching benchmark, consisting of a mixture of real and synthetic objects with ground truth camera motion and dense three-dimensional geometry. 1

    Robust Learning Architectures for Perceiving Object Semantics and Geometry

    Get PDF
    Parsing object semantics and geometry in a scene is one core task in visual understanding. This includes classification of object identity and category, localizing and segmenting an object from cluttered background, estimating object orientation and parsing 3D shape structures. With the emergence of deep convolutional architectures in recent years, substantial progress has been made towards learning scalable image representation for large-scale vision problems such as image classification. However, there still remains some fundamental challenges in learning robust object representation. First, creating object representations that are robust to changes in viewpoint while capturing local visual details continues to be a problem. In particular, recent convolutional architectures employ spatial pooling to achieve scale and shift invariances, but they are still sensitive to out-of-plane rotations. Second, deep Convolutional Neural Networks (CNNs) are purely driven by data and predominantly pose the scene interpretation problem as an end-to-end black-box mapping. However, decades of work on perceptual organization in both human and machine vision suggests that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this dissertation, we present two methodologies to surmount the aforementioned two issues. We first introduce a multi-domain pooling framework which group local visual signals within generic feature spaces that are invariant to 3D object transformation, thereby reducing the sensitivity of output feature to spatial deformations. We formulate a probabilistic analysis of pooling which further suggests the multi-domain pooling principle. In addition, this principle guides us in designing convolutional architectures which achieve state-of-the-art performance on instance classification and semantic segmentation. We also present a multi-view fusion algorithm which efficiently computes multi-domain pooling feature on incrementally reconstructed scenes and aggregates semantic confidence to boost long-term performance for semantic segmentation. Next, we explore an approach for injecting prior domain structure into neural network training, which leads a CNN to recover a sequence of intermediate milestones towards the final goal. Our approach supervises hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method.One advantage of this approach is that we are able to generalize the model trained from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, to real image domain. We implement this deep supervision framework with a novel CNN architecture which is trained on synthetic image only and achieves the state-of-the-art performance of 2D/3D keypoint localization on real image benchmarks. Finally, the proposed deep supervision scheme also motivates an approach for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views. To learn discriminative pose features, we integrate three new capabilities into a deep CNN: an inference scheme that combines both classification and pose regression based on an uniform tessellation of SE(3), fusion of a class prior into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. Further, an efficient multi-view framework is formulated to address single-view ambiguity. We show the proposed multi-view scheme consistently improves the performance of the single-view network. Our approach achieves the competitive or superior performance over the current state-of-the-art methods on three large-scale benchmarks

    Computational Modeling of Human Dorsal Pathway for Motion Processing

    Get PDF
    Reliable motion estimation in videos is of crucial importance for background iden- tification, object tracking, action recognition, event analysis, self-navigation, etc. Re- constructing the motion field in the 2D image plane is very challenging, due to variations in image quality, scene geometry, lighting condition, and most importantly, camera jit- tering. Traditional optical flow models assume consistent image brightness and smooth motion field, which are violated by unstable illumination and motion discontinuities that are common in real world videos. To recognize observer (or camera) motion robustly in complex, realistic scenarios, we propose a biologically-inspired motion estimation system to overcome issues posed by real world videos. The bottom-up model is inspired from the infrastructure as well as functionalities of human dorsal pathway, and the hierarchical processing stream can be divided into three stages: 1) spatio-temporal processing for local motion, 2) recogni- tion for global motion patterns (camera motion), and 3) preemptive estimation of object motion. To extract effective and meaningful motion features, we apply a series of steer- able, spatio-temporal filters to detect local motion at different speeds and directions, in a way that\u27s selective of motion velocity. The intermediate response maps are cal- ibrated and combined to estimate dense motion fields in local regions, and then, local motions along two orthogonal axes are aggregated for recognizing planar, radial and circular patterns of global motion. We evaluate the model with an extensive, realistic video database that collected by hand with a mobile device (iPad) and the video content varies in scene geometry, lighting condition, view perspective and depth. We achieved high quality result and demonstrated that this bottom-up model is capable of extracting high-level semantic knowledge regarding self motion in realistic scenes. Once the global motion is known, we segment objects from moving backgrounds by compensating for camera motion. For videos captured with non-stationary cam- eras, we consider global motion as a combination of camera motion (background) and object motion (foreground). To estimate foreground motion, we exploit corollary dis- charge mechanism of biological systems and estimate motion preemptively. Since back- ground motions for each pixel are collectively introduced by camera movements, we apply spatial-temporal averaging to estimate the background motion at pixel level, and the initial estimation of foreground motion is derived by comparing global motion and background motion at multiple spatial levels. The real frame signals are compared with those derived by forward predictions, refining estimations for object motion. This mo- tion detection system is applied to detect objects with cluttered, moving backgrounds and is proved to be efficient in locating independently moving, non-rigid regions. The core contribution of this thesis is the invention of a robust motion estimation system for complicated real world videos, with challenges by real sensor noise, complex natural scenes, variations in illumination and depth, and motion discontinuities. The overall system demonstrates biological plausibility and holds great potential for other applications, such as camera motion removal, heading estimation, obstacle avoidance, route planning, and vision-based navigational assistance, etc
    corecore