205 research outputs found

    Recovering dense 3D motion and shape information from RGB-D data

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.3D motion and 3D shape information are essential to many research fields, such as computer vision, computer graphics, and augmented reality. Thus, 3D motion estimation and 3D shape recovery are two important topics in these research communities. RGB-D cameras have become more accessible in recent few years. They are popular for good mobility, low cost, and high frame rate. However, these RGB-D cameras generate low-resolution and low-accuracy depth images due to chip size limitations and ambient illumination perturbation. Thus, obtaining high-resolution and high-accuracy 3D information based on RGB-D data is an important task. This research investigates 3D motion estimation and 3D shape recovery solutions for RGB-D cameras. Thus, within this thesis, various methods are developed and presented to address the following research challenges: fusing passive stereo vision and active depth acquisition; 3D motion estimation based on RGB-D data; depth super-resolution based on RGB-D video with large displacement 3D motion. In Chapter 3, a framework is presented to acquire depth images by fusing active depth acquisition and passive stereo vision. Active depth acquisition and passive stereo vision have their limitations in some aspects, but their range-sensing characteristics are complementary. Thus, combining both approaches can produce more accurate results than using either one only. Unlike previous fusion methods, the noisy depth observation from active depth acquisition is initially taken as a prior knowledge of the scene structure, which improves the accuracy of the fused depth images. Chapter 4 details a method for 3D scene ow estimation based on RGB-D data. The accuracy of scene ow estimation is limited by two issues: occlusions and large displacement motions. To handle occlusions, the occlusion status is modelled, and the scene ow and occluded regions are jointly estimated. To deal with large displacement motions, an over-parameterised scene ow representation is employed to model both the rotation and translation components of the scene ow. In Chapter 5, a depth super-resolution framework is presented for RGB-D video sequences with large 3D motion. To handle large 3D motion, our framework has two stages: motion compensation and fusion. A superpixel-based motion estimation approach is proposed for efficient motion compensation. The fusion task is modelled as a regression problem, and a specific deep convolutional neural network (CNN) is designed that can learns the mapping function between depth image observations and the fused depth image given a large amount of training data

    Super-resolution of 3-dimensional scenes

    Full text link
    Super-resolution is an image enhancement method that increases the resolution of images and video. Previously this technique could only be applied to 2D scenes. The super-resolution algorithm developed in this thesis creates high-resolution views of 3-dimensional scenes, using low-resolution images captured from varying, unknown positions

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Non-Standard Imaging Techniques

    Get PDF
    The first objective of the thesis is to investigate the problem of reconstructing a small-scale object (a few millimeters or smaller) in 3D. In Chapter 3, we show how this problem can be solved effectively by a new multifocus multiview 3D reconstruction procedure which includes a new Fixed-Lens multifocus image capture and a calibrated image registration technique using analytic homography transformation. The experimental results using the real and synthetic images demonstrate the effectiveness of the proposed solutions by showing that both the fixed-lens image capture and multifocus stacking with calibrated image alignment significantly reduce the errors in the camera poses and produce more complete 3D reconstructed models as compared with those by the conventional moving lens image capture and multifocus stacking. The second objective of the thesis is modelling the dual-pixel (DP) camera. In Chapter 4, to understand the potential of the DP sensor for computer vision applications, we study the formation of the DP pair which links the blur and the depth information. A mathematical DP model is proposed which can benefit depth estimation by the blur. These explorations motivate us to propose an end-to-end DDDNet (DP-based Depth and Deblur Network) to jointly estimate the depth and restore the image . Moreover, we define a reblur loss, which reflects the relationship of the DP image formation process with depth information, to regularize our depth estimate in training. To meet the requirement of a large amount of data for learning, we propose the first DP image simulator which allows us to create datasets with DP pairs from any existing RGBD dataset. As a side contribution, we collect a real dataset for further research. Extensive experimental evaluation on both synthetic and real datasets shows that our approach achieves competitive performance compared to state-of-the-art approaches. Another (third) objective of this thesis is to tackle the multifocus image fusion problem, particularly for long multifocus image sequences. Multifocus image stacking/fusion produces an in-focus image of a scene from a number of partially focused images of that scene in order to extend the depth of field. One of the limitations of the current state of the art multifocus fusion methods is not considering image registration/alignment before fusion. Consequently, fusing unregistered multifocus images produces an in-focus image containing misalignment artefacts. In Chapter 5, we propose image registration by projective transformation before fusion to remove the misalignment artefacts. We also propose a method based on 3D deconvolution to retrieve the in-focus image by formulating the multifocus image fusion problem as a 3D deconvolution problem. The proposed method achieves superior performance compared to the state of the art methods. It is also shown that, the proposed projective transformation for image registration can improve the quality of the fused images. Moreover, we implement a multifocus simulator to generate synthetic multifocus data from any RGB-D dataset. The fourth objective of this thesis is to explore new ways to detect the polarization state of light. To achieve the objective, in Chapter 6, we investigate a new optical filter namely optical rotation filter for detecting the polarization state with a fewer number of images. The proposed method can estimate polarization state using two images, one with the filter and another without. The accuracy of estimating the polarization parameters using the proposed method is almost similar to that of the existing state of the art method. In addition, the feasibility of detecting the polarization state using only one RGB image captured with the optical rotation filter is also demonstrated by estimating the image without the filter from the image with the filter using a generative adversarial network

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Real-Time Structure and Object Aware Semantic SLAM

    Get PDF
    Simultaneous Localization And Mapping (SLAM) is one of the fundamental problems in mobile robotics and addresses the reconstruction of a previously unseen environment while simultaneously localising a mobile robot with respect to it. For visual-SLAM, the simplest representation of the map is a collection of 3D points that is sparse and efficient to compute and update, particularly for large-scale environments, however it lacks semantic information and is not useful for high-level tasks such as robotic grasping and manipulation. Although methods to compute denser representations have been proposed, these reconstructions remain equivalent to a collection of points and therefore carry no additional semantic information or relationship. Man-made environments contain many structures and objects that carry high-level semantics and can potentially act as landmarks of a SLAM map, while encapsulating semantic information as opposed to a set of points. For instance, planes are good representations for feature deprived regions, where they provide information complimentary to points and can also model dominant planar layouts of the environment with very few parameters. Furthermore, a generic representation for previously unseen objects can be used as a general landmark that carries semantics in the reconstructed map. Integrating visual semantic understanding and geometric reconstruction has been studied before, however due to various reasons, including high- level geometric entities in the SLAM framework has been restricted to a slow, offline structure-from-motion context, or high-level entities merely act as regulators for points in the map instead of independent landmarks. One of those critical reasons is the lack of proper mathematical representation for high-level landmarks and the other main reasons are the challenge of detection and tracking of these landmarks and formulating an observation model – a mapping between corresponding image observable quantities and estimated parameters of the representations. In this work, we address these challenges to achieve an online real-time SLAM framework with scalable maps consisting of both sparse points and high-level structural and semantic landmarks such as planes and objects. We explicitly target real-time performance and keep that as a beacon which influences critically the representation choice and all the modules of our SLAM system. In the context of factor graphs, we propose novel representations for structural entities as planes and general unseen and not-predefined objects as bounded dual quadrics that decompose to permit clean, fast and effective real-time implementation that is amenable to the nonlinear leastsquare formulation and respects the sparsity pattern of the SLAM problem. In this representation we are not concerned with high-fidelity reconstruction of individual objects, but rather to represent the general layout and orientation of objects in the environment. Also the minimal representations of planes is explored leading to a representation that can be constructed and updated online in a least-squares framework. Another challenge that we address in this work is to marry high-level landmark detections based on deep-learned frameworks, with geometric SLAM systems. Due to the recent success of CNN-based object detections and also depth and surface normal estimations from single image, it is feasible now to detect and estimate these semantic landmarks from single RGB images, therefore leading us seamlessly from RGB-D SLAM system to pure monocular SLAM thanks to the real-time predictions of the trained CNN and appropriate representations. Furthermore, to benefit from deep-learned priors, we incorporate high-fidelity single-image reconstructions and hallucinations of objects on top of the coarse quadrics to enrich the sparse map semantically, while constraining the shape of the coarse quadrics even more. Pertinent to our beacon, proposed landmark representations in the map also provide the potential for imposing additional constraints and priors that carry crucial semantic information about the scene, without incurring great extra computational cost. In this work, we have explored and proposed constraints such as priors on the extent and shape of the objects, point-plane regularizer, plane-plane (Manhattan assumption), and plane-object (supporting affordance) constraints. We evaluate our proposed SLAM system extensively using different input sensor modalities from RGB-D to monocular in almost all publicly available benchmarks both indoors and outdoors to show its applicability as a general-purpose SLAM solution. The extensive experiments show the efficacy of our SLAM through different comparisons and ablation studies including high-level structures and objects with imposed constraints among them in various scenarios. In particular, the estimated camera trajectories have been improved significantly in varied sequences of visual SLAM datasets and also our own captured sequences with UR5 robotic arm equipped with a depth camera. In addition to more accurate camera trajectories, our system yields enriched sparse maps with semantically meaningful planar structures and generic objects in the scene along with their mutual relationshipsThesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision

    Get PDF
    Higher dimensional data such as video and 3D are the leading edge of multimedia retrieval and computer vision research. In this survey, we give a comprehensive overview and key insights into the state of the art of higher dimensional features from deep learning and also traditional approaches. Current approaches are frequently using 3D information from the sensor or are using 3D in modeling and understanding the 3D world. With the growth of prevalent application areas such as 3D games, self-driving automobiles, health monitoring and sports activity training, a wide variety of new sensors have allowed researchers to develop feature description models beyond 2D. Although higher dimensional data enhance the performance of methods on numerous tasks, they can also introduce new challenges and problems. The higher dimensionality of the data often leads to more complicated structures which present additional problems in both extracting meaningful content and in adapting it for current machine learning algorithms. Due to the major importance of the evaluation process, we also present an overview of the current datasets and benchmarks. Moreover, based on more than 330 papers from this study, we present the major challenges and future directions. Computer Systems, Imagery and Medi
    corecore