18 research outputs found
Self-supervised learning for point cloud data: A survey
3D point clouds are a crucial type of data collected by LiDAR sensors and widely used in transportation applications due to its concise descriptions and accurate localization. Deep neural networks (DNNs) have achieved remarkable success in processing large amount of disordered and sparse 3D point clouds, especially in various computer vision tasks, such as pedestrian detection and vehicle recognition. Among all the learning paradigms, Self-Supervised Learning (SSL), an unsupervised training paradigm that mines effective information from the data itself, is considered as an essential solution to solve the time-consuming and labor-intensive data labeling problems via smart pre-training task design. This paper provides a comprehensive survey of recent advances on SSL for point clouds. We first present an innovative taxonomy, categorizing the existing SSL methods into four broad categories based on the pretexts’ characteristics. Under each category, we then further categorize the methods into more fine-grained groups and summarize the strength and limitations of the representative methods. We also compare the performance of the notable SSL methods in literature on multiple downstream tasks on benchmark datasets both quantitatively and qualitatively. Finally, we propose a number of future research directions based on the identified limitations of existing SSL research on point clouds
Multiple depth maps integration for 3D reconstruction using geodesic graph cuts
Depth images, in particular depth maps estimated from stereo vision, may have a substantial amount of outliers and result in inaccurate 3D modelling and reconstruction. To address this challenging issue, in this paper, a graph-cut based multiple depth maps integration approach is proposed to obtain smooth and watertight surfaces. First, confidence maps for the depth images are estimated to suppress noise, based on which reliable patches covering the object surface are determined. These patches are then exploited to estimate the path weight for 3D geodesic distance computation, where an adaptive regional term is introduced to deal with the “shorter-cuts” problem caused by the effect of the minimal surface bias. Finally, the adaptive regional term and the boundary term constructed using patches are combined in the graph-cut framework for more accurate and smoother 3D modelling. We demonstrate the superior performance of our algorithm on the well-known Middlebury multi-view database and additionally on real-world multiple depth images captured by Kinect. The experimental results have shown that our method is able to preserve the object protrusions and details while maintaining surface smoothness
Novel robust computer vision algorithms for micro autonomous systems
People detection and tracking are an essential component of many autonomous platforms, interactive systems and intelligent vehicles used in various search and rescues operations and similar humanitarian applications. Currently, researchers are focusing on the use of vision sensors such as cameras due to their advantages over other sensor types. Cameras are information rich, relatively inexpensive and easily available. Additionally, 3D information is obtained from stereo vision, or by triangulating over several frames in monocular configurations. Another method to obtain 3D data is by using RGB-D sensors (e.g. Kinect) that provide both image and depth data. This method is becoming more attractive over the past few years due to its affordable price and availability for researchers. The aim of this research was to find robust multi-target detection and tracking algorithms for Micro Autonomous Systems (MAS) that incorporate the use of the RGB-D sensor. Contributions include the discovery of novel robust computer vision algorithms. It proposed a new framework for human body detection, from video file, to detect a single person adapted from Viola and Jones framework. The 2D Multi Targets Detection and Tracking (MTDT) algorithm applied the Gaussian Mixture Model (GMM) to reduce noise in the pre-processing stage. Blob analysis was used to detect targets, and Kalman filter was used to track targets. The 3D MTDT extends beyond 2D with the use of depth data from the RGB-D sensor in the pre-processing stage. Bayesian model was employed to provide multiple cues. It includes detection of the upper body, face, skin colour, motion and shape. Kalman filter proved for speed and robustness of the track management. Simultaneous Localisation and Mapping (SLAM) fusing with 3D information was investigated. The new framework introduced front end and back end processing. The front end consists of localisation steps, post refinement and loop closing system. The back-end focus on the post-graph optimisation to eliminate errors.The proposed computer vision algorithms proved for better speed and robustness. The frameworks produced impressive results. New algorithms can be used to improve performances in real time applications including surveillance, vision navigation, environmental perception and vision-based control system on MAS
Room layout estimation on mobile devices
Room layout generation is the problem of generating a drawing or a digital model of an existing room from a set of measurements such as laser data or images. The generation of floor plans can find application in the building industry to assess the quality and the correctness of an ongoing construction w.r.t. the initial model, or to quickly sketch the renovation of an apartment. Real estate industry can rely on automatic generation of floor plans to ease the process of checking the livable surface and to propose virtual visits to prospective customers. As for the general public, the room layout can be integrated into mixed reality games to provide a better immersiveness experience, or used in other related augmented reality applications such room redecoration. The goal of this industrial thesis (CIFRE) is to investigate and take advantage of the state-of-the art mobile devices in order to automate the process of generating room layouts. Nowadays, modern mobile devices usually come a wide range of sensors, such as inertial motion unit (IMU), RGB cameras and, more recently, depth cameras. Moreover, tactile touchscreens offer a natural and simple way to interact with the user, thus favoring the development of interactive applications, in which the user can be part of the processing loop. This work aims at exploiting the richness of such devices to address the room layout generation problem. The thesis has three major contributions. We first show how the classic problem of detecting vanishing points in an image can benefit from an a-priori given by the IMU sensor. We propose a simple and effective algorithm for detecting vanishing points relying on the gravity vector estimated by the IMU. A new public dataset containing images and the relevant IMU data is introduced to help assessing vanishing point algorithms and foster further studies in the field. As a second contribution, we explored the state of-the-art of real-time localization and map optimization algorithms for RGB-D sensors. Real-time localization is a fundamental task to enable augmented reality applications, and thus it is a critical component when designing interactive applications. We propose an evaluation of existing algorithms for the common desktop set-up in order to be employed on a mobile device. For each considered method, we assess the accuracy of the localization as well as the computational performances when ported on a mobile device. Finally, we present a proof of concept of application able to generate the room layout relying on a Project Tango tablet equipped with an RGB-D sensor. In particular, we propose an algorithm that incrementally processes and fuses the 3D data provided by the sensor in order to obtain the layout of the room. We show how our algorithm can rely on the user interactions in order to correct the generated 3D model during the acquisition process
Reconstruction and recognition of confusable models using three-dimensional perception
Perception is one of the key topics in robotics research. It is about the processing
of external sensor data and its interpretation. The necessity of fully autonomous
robots makes it crucial to help them to perform tasks more reliably, flexibly, and
efficiently. As these platforms obtain more refined manipulation capabilities, they
also require expressive and comprehensive environment models: for manipulation
and affordance purposes, their models have to involve each one of the objects
present in the world, coincidentally with their location, pose, shape and other aspects.
The aim of this dissertation is to provide a solution to several of these challenges
that arise when meeting the object grasping problem, with the aim of improving
the autonomy of the mobile manipulator robot MANFRED-2. By the analysis
and interpretation of 3D perception, this thesis covers in the first place the
localization of supporting planes in the scenario. As the environment will contain
many other things apart from the planar surface, the problem within cluttered
scenarios has been solved by means of Differential Evolution, which is a particlebased
evolutionary algorithm that evolves in time to the solution that yields the
cost function lowest value.
Since the final purpose of this thesis is to provide with valuable information for
grasping applications, a complete model reconstructor has been developed. The
proposed method holdsmany features such as robustness against abrupt rotations,
multi-dimensional optimization, feature extensibility, compatible with other scan
matching techniques, management of uncertain information and an initialization
process to reduce convergence timings. It has been designed using a evolutionarybased
scan matching optimizer that takes into account surface features of the object,
global form and also texture and color information.
The last tackled challenge regards the recognition problem. In order to procure
with worthy information about the environment to the robot, a meta classifier that discerns efficiently the observed objects has been implemented. It is capable
of distinguishing between confusable objects, such as mugs or dishes with similar
shapes but different size or color.
The contributions presented in this thesis have been fully implemented and
empirically evaluated in the platform. A continuous grasping pipeline covering
from perception to grasp planning including visual object recognition for confusable
objects has been developed. For that purpose, an indoor environment with
several objects on a table is presented in the nearby of the robot. Items are recognized
from a database and, if one is chosen, the robot will calculate how to grasp
it taking into account the kinematic restrictions associated to the anthropomorphic
hand and the 3D model for this particular object. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------La percepción es uno de los temas más relevantes en el mundo de la investigaci
ón en robótica. Su objetivo es procesar e interpretar los datos recibidos por
un sensor externo. La gran necesidad de desarrollar robots autónomos hace imprescindible
proporcionar soluciones que les permita realizar tareas más precisas,
flexibles y eficientes. Dado que estas plataformas cada día adquieren mejores capacidades
para manipular objetos, también necesitarán modelos expresivos y comprensivos:
para realizar tareas de manipulación y prensión, sus modelos han de
tener en cuenta cada uno de los objetos presentes en su entorno, junto con su localizaci
ón, orientación, forma y otros aspectos.
El objeto de la presente tesis doctoral es proponer soluciones a varios de los
retos que surgen al enfrentarse al problema del agarre, con el propósito final de
aumentar la capacidad de autonomía del robot manipulador MANFRED-2. Mediante
el análisis e interpretación de la percepción tridimensional, esta tesis cubre
en primer lugar la localización de planos de soporte en sus alrededores. Dado que
el entorno contendrá muchos otros elementos aparte de la superficie de apoyo buscada, el problema en entornos abarrotados ha sido solucionado mediante Evolución
Diferencial, que es un algoritmo evolutivo basado en partículas que evoluciona
temporalmente a la solución que contempla el menor resultado en la función de
coste.
Puesto que el propósito final de este trabajo de investigación es proveer de información valiosa a las aplicaciones de prensión, se ha desarrollado un reconstructor
de modelos completos. El método propuesto posee diferentes características
como robustez a giros abruptos, optimización multidimensional, extensión a otras
características, compatibilidad con otras técnicas de reconstrucción, manejo de incertidumbres
y un proceso de inicialización para reducir el tiempo de convergencia. Ha sido diseñado usando un registro optimizado mediante técnicas evolutivas
que tienen en cuenta las particularidades de la superficie del objeto, su forma
global y la información relativa a la textura.
El último problema abordado está relacionado con el reconocimiento de objetos. Con la intención de abastecer al robot con la mayor información posible sobre el entorno, se ha implementado un meta clasificador que diferencia de manera eficaz los objetos observados. Ha sido capacitado para distinguir objetos confundibles como tazas o platos con formas similares pero con diferentes colores o tamaños.
Las contribuciones presentes en esta tesis han sido completamente implementadas y probadas de manera empírica en la plataforma. Se ha desarrollado un sistema que cubre el problema de agarre desde la percepción al cálculo de la trayectoria
incluyendo el sistema de reconocimiento de objetos confundibles. Para ello, se ha presentado una mesa con objetos en un entorno cerrado cercano al robot. Los elementos son comparados con una base de datos y si se desea agarrar uno de ellos,
el robot estimará cómo cogerlo teniendo en cuenta las restricciones cinemáticas asociadas a una mano antropomórfica y el modelo tridimensional generado del objeto en cuestión
Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey
Autonomous systems possess the features of inferring their own state,
understanding their surroundings, and performing autonomous navigation. With
the applications of learning systems, like deep learning and reinforcement
learning, the visual-based self-state estimation, environment perception and
navigation capabilities of autonomous systems have been efficiently addressed,
and many new learning-based algorithms have surfaced with respect to autonomous
visual perception and navigation. In this review, we focus on the applications
of learning-based monocular approaches in ego-motion perception, environment
perception and navigation in autonomous systems, which is different from
previous reviews that discussed traditional methods. First, we delineate the
shortcomings of existing classical visual simultaneous localization and mapping
(vSLAM) solutions, which demonstrate the necessity to integrate deep learning
techniques. Second, we review the visual-based environmental perception and
understanding methods based on deep learning, including deep learning-based
monocular depth estimation, monocular ego-motion prediction, image enhancement,
object detection, semantic segmentation, and their combinations with
traditional vSLAM frameworks. Then, we focus on the visual navigation based on
learning systems, mainly including reinforcement learning and deep
reinforcement learning. Finally, we examine several challenges and promising
directions discussed and concluded in related research of learning systems in
the era of computer science and robotics.Comment: This paper has been accepted by IEEE TNNL
Deeply Learned Priors for Geometric Reconstruction
This thesis comprises of a body of work that investigates the use of deeply learned priors for dense geometric reconstruction of scenes. A typical image captured by a 2D camera sensor is a lossy two-dimensional (2D) projection of our three-dimensional (3D) world. Geometric reconstruction approaches usually recreate the lost structural information by taking in multiple images observing a scene from different views and solving a problem known as Structure from Motion (SfM) or Simultaneous Localization and Mapping (SLAM). Remarkably, by establishing correspondences across images and use of geometric models, these methods (under reasonable conditions) can reconstruct a scene's 3D structure as well as precisely localise the observed views relative to the scene. The success of dense every-pixel multi-view reconstruction is however limited by matching ambiguities that commonly arise due to uniform texture, occlusion, and appearance distortion, among several other factors. The standard approach to deal with matching ambiguities is to handcraft priors based on assumptions like piecewise smoothness or planarity in the 3D map, in order to "fill in" map regions supported by little or ambiguous matching evidence. In this thesis we propose learned priors that in comparison more closely model the true structure of the scene and are based on geometric information predicted from the images. The motivation stems from recent advancements in deep learning algorithms and availability of massive datasets, that have allowed Convolutional Neural Networks (CNNs) to predict geometric properties of a scene such as point-wise surface normals and depths, from just a single image, more reliably than what was possible using previous machine learning-based or hand-crafted methods. In particular, we first explore how single image-based surface normals from a CNN trained on massive amount of indoor data can benefit the accuracy of dense reconstruction given input images from a moving monocular camera. Here we propose a novel surface normal based inverse depth regularizer and compare its performance against the inverse depth smoothness prior that is typically used to regularize regions in the reconstruction that are textureless. We also propose the first real-time CNN-based framework for live dense monocular reconstruction using our learned normal prior. Next, we look at how we can use deep learning to learn features in order to improve the pixel matching process itself, which is at the heart of multi-view geometric reconstruction. We propose a self-supervised feature learning scheme using RGB-D data from a 3D sensor (that does not require any manual labelling) and a multi-scale CNN architecture for feature extraction that is fast and eficient to run inside our proposed real-time monocular reconstruction framework. We extensively analyze the combined benefits of using learned normals and deep features that are good-for-matching in the context of dense reconstruction, both quantitatively and qualitatively on large real world datasets. Lastly, we explore how learned depths, also predicted on a per-pixel basis from a single image using a CNN, can be used to inpaint sparse 3D maps obtained from monocular SLAM or a 3D sensor. We propose a novel model that uses predicted depths and confidences from CNNs as priors to inpaint maps with arbitrary scale and sparsity. We obtain more reliable reconstructions than those of traditional depth inpainting methods such as the cross-bilateral filter that in comparison offer few learnable parameters. Here we advocate the idea of "just-in-time reconstruction" where a higher level of scene understanding reliably inpaints the corresponding portion of a sparse map on-demand and in real-time.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201
Recommended from our members
Real-time spatial modeling to detect and track resources on construction sites
For more than 10 years the U.S. construction industry has experienced over 1,000
fatalities annually. Many fatalities may have been prevented had the individuals and
equipment involved been more aware of and alert to the physical state of the environment
around them. Awareness may be improved by automatic 3D (three-dimensional) sensing
and modeling of the job site environment in real-time. Existing 3D modeling approaches
based on range scanning techniques are capable of modeling static objects only, and thus
cannot model in real-time dynamic objects in an environment comprised of moving
humans, equipment, and materials. Emerging prototype 3D video range cameras offer
another alternative by facilitating affordable, wide field of view, automated static and
dynamic object detection and tracking at frame rates better than 1Hz (real-time).
This dissertation presents an imperical work and methodology to rapidly create a
spatial model of construction sites and in particular to detect, model, and track the position, dimension, direction, and velocity of static and moving project resources in real-time, based on range data obtained from a three-dimensional video range camera in a
static or moving position. Existing construction site 3D modeling approaches based on
optical range sensing technologies (laser scanners, rangefinders, etc.) and 3D modeling
approaches (dense, sparse, etc.) that offered potential solutions for this research are
reviewed. The choice of an emerging sensing tool and preliminary experiments with this
prototype sensing technology are discussed. These findings led to the development of a
range data processing algorithm based on three-dimensional occupancy grids which is
demonstrated in detail. Testing and validation of the proposed algorithms have been
conducted to quantify the performance of sensor and algorithm through extensive
experimentation involving static and moving objects. Experiments in indoor laboratory
and outdoor construction environments have been conducted with construction resources
such as humans, equipment, materials, or structures to verify the accuracy of the
occupancy grid modeling approach. Results show that modeling objects and measuring
their position, dimension, direction, and speed had an accuracy level compatible to the
requirements of active safety features for construction. Results demonstrate that video
rate 3D data acquisition and analysis of construction environments can support effective
detection, tracking, and convex hull modeling of objects. Exploiting rapidly generated
three-dimensional models for improved visualization, communications, and process
control has inherent value, broad application, and potential impact, e.g. as-built vs. as-planned comparison, condition assessment, maintenance, operations, and construction
activities control. In combination with effective management practices, this sensing
approach has the potential to assist equipment operators to avoid incidents that result in
reduce human injury, death, or collateral damage on construction sites.Civil, Architectural, and Environmental Engineerin