7 research outputs found

    How Geometry Meets Learning in Pose Estimation

    Get PDF
    This thesis focuses on one of the fundamental problems in computer vision, sixdegree- of-freedom (6dof) pose estimation, whose task is to predict the geometric transformation from the camera to a target of interest, from only RGB inputs. Solutions to this problem have been proposed using the technique of image retrieval or sparse 2D-3D correspondence matching with geometric verification. Thanks to the development of deep learning, the direct regression-based (compute pose directly from image-to-pose regression) and indirect reconstruction-based (solve pose via dense matching between image and 3D reconstruction) approaches using neural network recently draw growing attention in community. Although models have been proposed for both camera relocalisation and object pose estimation using a deep network base, there are still open questions. In this thesis, we investigate several problems in pose estimation regarding end-to-end object pose inference, uncertainty of pose estimation in regression-based method and self-supervision for reconstruction-based learning both for scenes and objects. We focus on the end-to-end 6dof pose regression for objects in the first part of this thesis. Traditional methods that predict the 6dof pose for objects usually rely on the 3D CAD model and require a multi-step scheme to compute the pose. We alternatively use the idea of direct pose regression for objects based on a region proposed network Mask R-CNN, which is well-known for object detection and instance segmentation. Our newly proposed network head regresses a 4D vector from the RoI feature map of each object. A 3D vector from Lie algebra is used as the representation for rotation. Another one scalar for the z-axis of translation is predicted to recover the full 3D translation along with the position of bounding boxes. This simplification avoids the spatial ambiguity for object in the scope of 2D image caused by RoIPooling. Our method performs accurately at inference time, and faster than methods that require 3D models and refinement in their pipeline. We estimate the uncertainty for the pose regressed by a deep model in the second part. A CNN is combined with Gaussian Process Regression (GPR) to build a framework that directly obtains a predictive distribution over camera pose. The combination is achieved by exploiting the CNN to extract discriminative features and using the GPR to perform probabilistic inference. In order to prevent the complexity of uncertainty estimation from growing with the number of training images in the datasets, we use pseudo inducing CNN feature points to represent the whole dataset and learn their representations using Stochastic Variational Inference (SVI). This makes GPR a parametric model, which can be learnt together with the CNN backbone at the same time. We test the proposed hybrid framework on the problem of camera relocalisation. The third and fourth parts of our thesis have similar objectives: seeking selfsupervision for the learning of dense reconstruction for pose estimation from images without using the ground truth 3D model of scenes (in part 3) and objects (in part 4). We explore an alternative supervisory signal from multi-view geometry. Photometric and/or featuremetric consistency in image pairs from different viewpoints is proposed to constrain the learning of the world-centric coordinates (part 3) and object-centric coordinates (part 4). The dense reconstruction model is subsequently used as 2D-3D correspondences establisher at inference time to compute the 6dof pose using PnP plus RANSAC. Our 3D model free methods for pose estimation eliminate the dependency on 3D models used in state-of-the-art approaches.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Label Efficient 3D Scene Understanding

    Get PDF
    3D scene understanding models are becoming increasingly integrated into modern society. With applications ranging from autonomous driving, Augmented Real- ity, Virtual Reality, robotics and mapping, the demand for well-behaved models is rapidly increasing. A key requirement for training modern 3D models is high- quality manually labelled training data. Collecting training data is often the time and monetary bottleneck, limiting the size of datasets. As modern data-driven neu- ral networks require very large datasets to achieve good generalisation, finding al- ternative strategies to manual labelling is sought after for many industries. In this thesis, we present a comprehensive study on achieving 3D scene under- standing with fewer labels. Specifically, we evaluate 4 approaches: existing data, synthetic data, weakly-supervised and self-supervised. Existing data looks at the potential of using readily available national mapping data as coarse labels for train- ing a building segmentation model. We further introduce an energy-based active contour snake algorithm to improve label quality by utilising co-registered LiDAR data. This is attractive as whilst the models may still require manual labels, these labels already exist. Synthetic data also exploits already existing data which was not originally designed for training neural networks. We demonstrate a pipeline for generating a synthetic Mobile Laser Scanner dataset. We experimentally evalu- ate if such a synthetic dataset can be used to pre-train smaller real-world datasets, increasing the generalisation with less data. A weakly-supervised approach is presented which allows for competitive per- formance on challenging real-world benchmark 3D scene understanding datasets with up to 95% less data. We propose a novel learning approach where the loss function is learnt. Our key insight is that the loss function is a local function and therefore can be trained with less data on a simpler task. Once trained our loss function can be used to train a 3D object detector using only unlabelled scenes. Our method is both flexible and very scalable, even performing well across datasets. Finally, we propose a method which only requires a single geometric represen- tation of each object class as supervision for 3D monocular object detection. We discuss why typical L2-like losses do not work for 3D object detection when us- ing differentiable renderer-based optimisation. We show that the undesirable local- minimas that the L2-like losses fall into can be avoided with the inclusion of a Generative Adversarial Network-like loss. We achieve state-of-the-art performance on the challenging 6DoF LineMOD dataset, without any scene level labels

    Towards Object-Centric Scene Understanding

    Get PDF
    Visual perception for autonomous agents continues to attract community attention due to the disruptive technologies and the wide applicability of such solutions. Autonomous Driving (AD), a major application in this domain, promises to revolutionize our approach to mobility while bringing critical advantages in limiting accident fatalities. Fueled by recent advances in Deep Learning (DL), more computer vision tasks are being addressed using a learning paradigm. Deep Neural Networks (DNNs) succeeded consistently in pushing performances to unprecedented levels and demonstrating the ability of such approaches to generalize to an increasing number of difficult problems, such as 3D vision tasks. In this thesis, we address two main challenges arising from the current approaches. Namely, the computational complexity of multi-task pipelines, and the increasing need for manual annotations. On the one hand, AD systems need to perceive the surrounding environment on different levels of detail and, subsequently, take timely actions. This multitasking further limits the time available for each perception task. On the other hand, the need for universal generalization of such systems to massively diverse situations requires the use of large-scale datasets covering long-tailed cases. Such requirement renders the use of traditional supervised approaches, despite the data readily available in the AD domain, unsustainable in terms of annotation costs, especially for 3D tasks. Driven by the AD environment nature and the complexity dominated (unlike indoor scenes) by the presence of other scene elements (mainly cars and pedestrians) we focus on the above-mentioned challenges in object-centric tasks. We, then, situate our contributions appropriately in fast-paced literature, while supporting our claims with extensive experimental analysis leveraging up-to-date state-of-the-art results and community-adopted benchmarks

    Visual Perception For Robotic Spatial Understanding

    Get PDF
    Humans understand the world through vision without much effort. We perceive the structure, objects, and people in the environment and pay little direct attention to most of it, until it becomes useful. Intelligent systems, especially mobile robots, have no such biologically engineered vision mechanism to take for granted. In contrast, we must devise algorithmic methods of taking raw sensor data and converting it to something useful very quickly. Vision is such a necessary part of building a robot or any intelligent system that is meant to interact with the world that it is somewhat surprising we don\u27t have off-the-shelf libraries for this capability. Why is this? The simple answer is that the problem is extremely difficult. There has been progress, but the current state of the art is impressive and depressing at the same time. We now have neural networks that can recognize many objects in 2D images, in some cases performing better than a human. Some algorithms can also provide bounding boxes or pixel-level masks to localize the object. We have visual odometry and mapping algorithms that can build reasonably detailed maps over long distances with the right hardware and conditions. On the other hand, we have robots with many sensors and no efficient way to compute their relative extrinsic poses for integrating the data in a single frame. The same networks that produce good object segmentations and labels in a controlled benchmark still miss obvious objects in the real world and have no mechanism for learning on the fly while the robot is exploring. Finally, while we can detect pose for very specific objects, we don\u27t yet have a mechanism that detects pose that generalizes well over categories or that can describe new objects efficiently. We contribute algorithms in four of the areas mentioned above. First, we describe a practical and effective system for calibrating many sensors on a robot with up to 3 different modalities. Second, we present our approach to visual odometry and mapping that exploits the unique capabilities of RGB-D sensors to efficiently build detailed representations of an environment. Third, we describe a 3-D over-segmentation technique that utilizes the models and ego-motion output in the previous step to generate temporally consistent segmentations with camera motion. Finally, we develop a synthesized dataset of chair objects with part labels and investigate the influence of parts on RGB-D based object pose recognition using a novel network architecture we call PartNet

    Implicit Object Pose Estimation on RGB Images Using Deep Learning Methods

    Get PDF
    With the rise of robotic and camera systems and the success of deep learning in computer vision, there is growing interest in precisely determining object positions and orientations. This is crucial for tasks like automated bin picking, where a camera sensor analyzes images or point clouds to guide a robotic arm in grasping objects. Pose recognition has broader applications, such as predicting a car's trajectory in autonomous driving or adapting objects in virtual reality based on the viewer's perspective. This dissertation focuses on RGB-based pose estimation methods that use depth information only for refinement, which is a challenging problem. Recent advances in deep learning have made it possible to predict object poses in RGB images, despite challenges like object overlap, object symmetries and more. We introduce two implicit deep learning-based pose estimation methods for RGB images, covering the entire process from data generation to pose selection. Furthermore, theoretical findings on Fourier embeddings are shown to improve the performance of the so-called implicit neural representations - which are then successfully utilized for the task of implicit pose estimation
    corecore