10,798 research outputs found
Robust Intrinsic and Extrinsic Calibration of RGB-D Cameras
Color-depth cameras (RGB-D cameras) have become the primary sensors in most
robotics systems, from service robotics to industrial robotics applications.
Typical consumer-grade RGB-D cameras are provided with a coarse intrinsic and
extrinsic calibration that generally does not meet the accuracy requirements
needed by many robotics applications (e.g., highly accurate 3D environment
reconstruction and mapping, high precision object recognition and localization,
...). In this paper, we propose a human-friendly, reliable and accurate
calibration framework that enables to easily estimate both the intrinsic and
extrinsic parameters of a general color-depth sensor couple. Our approach is
based on a novel two components error model. This model unifies the error
sources of RGB-D pairs based on different technologies, such as
structured-light 3D cameras and time-of-flight cameras. Our method provides
some important advantages compared to other state-of-the-art systems: it is
general (i.e., well suited for different types of sensors), based on an easy
and stable calibration protocol, provides a greater calibration accuracy, and
has been implemented within the ROS robotics framework. We report detailed
experimental validations and performance comparisons to support our statements
Creating and Recognizing 3D Objects
This thesis aims at investigating on 3D Computer Vision, a research topic which is gathering even increasing attention thanks to the more and more widespread availability of affordable 3D visual sensor, such as, in particular consumer grade RGB-D cameras.
The contribution of this dissertation is twofold. First, the work addresses how to compactly represent the content of images acquired with RGB-D cameras. Second, the thesis focuses on 3D Reconstruction, key issue to efficiently populate the databases of 3D models deployed in object/category recognition scenarios.
As 3D Registration plays a fundamental role in 3D Reconstruction, the former part of the thesis proposes a pipeline for coarse registration of point clouds that is entirely based on the computation of 3D Local Reference Frames (LRF). Unlike related work in literature, we also propose a comprehensive experimental evaluation based on diverse kinds of data (such as those acquired by laser scanners, RGB-D and stereo cameras) as well as on quantitative comparison with respect to three other methods.
Driven by the ever-lower costs and growing distribution of 3D sensing devices, we expect broad-scale integration of depth sensing into mobile devices to be forthcoming.
Accordingly, the thesis investigates on merging appearance and shape information for Mobile Visual Search and focuses on encoding RGB-D images in compact binary codes.
An extensive experimental analysis on three state-of-the-art datasets, addressing both category and instance recognition scenarios, has led to the development of an RGB-D search engine architecture that can attain high recognition rates with peculiarly moderate bandwidth requirements.
Our experiments also include a comparison with the CDVS (Compact Descriptors for Visual Search) pipeline, candidate to become part of the MPEG-7 standard
mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Millimeter Wave (mmWave) Radar is gaining popularity as it can work in
adverse environments like smoke, rain, snow, poor lighting, etc. Prior work has
explored the possibility of reconstructing 3D skeletons or meshes from the
noisy and sparse mmWave Radar signals. However, it is unclear how accurately we
can reconstruct the 3D body from the mmWave signals across scenes and how it
performs compared with cameras, which are important aspects needed to be
considered when either using mmWave radars alone or combining them with
cameras. To answer these questions, an automatic 3D body annotation system is
first designed and built up with multiple sensors to collect a large-scale
dataset. The dataset consists of synchronized and calibrated mmWave radar point
clouds and RGB(D) images in different scenes and skeleton/mesh annotations for
humans in the scenes. With this dataset, we train state-of-the-art methods with
inputs from different sensors and test them in various scenarios. The results
demonstrate that 1) despite the noise and sparsity of the generated point
clouds, the mmWave radar can achieve better reconstruction accuracy than the
RGB camera but worse than the depth camera; 2) the reconstruction from the
mmWave radar is affected by adverse weather conditions moderately while the
RGB(D) camera is severely affected. Further, analysis of the dataset and the
results shadow insights on improving the reconstruction from the mmWave radar
and the combination of signals from different sensors.Comment: ACM Multimedia 2022, Project Page:
https://chen3110.github.io/mmbody/index.htm
RGB-D And Thermal Sensor Fusion: A Systematic Literature Review
In the last decade, the computer vision field has seen significant progress
in multimodal data fusion and learning, where multiple sensors, including
depth, infrared, and visual, are used to capture the environment across diverse
spectral ranges. Despite these advancements, there has been no systematic and
comprehensive evaluation of fusing RGB-D and thermal modalities to date. While
autonomous driving using LiDAR, radar, RGB, and other sensors has garnered
substantial research interest, along with the fusion of RGB and depth
modalities, the integration of thermal cameras and, specifically, the fusion of
RGB-D and thermal data, has received comparatively less attention. This might
be partly due to the limited number of publicly available datasets for such
applications. This paper provides a comprehensive review of both,
state-of-the-art and traditional methods used in fusing RGB-D and thermal
camera data for various applications, such as site inspection, human tracking,
fault detection, and others. The reviewed literature has been categorised into
technical areas, such as 3D reconstruction, segmentation, object detection,
available datasets, and other related topics. Following a brief introduction
and an overview of the methodology, the study delves into calibration and
registration techniques, then examines thermal visualisation and 3D
reconstruction, before discussing the application of classic feature-based
techniques as well as modern deep learning approaches. The paper concludes with
a discourse on current limitations and potential future research directions. It
is hoped that this survey will serve as a valuable reference for researchers
looking to familiarise themselves with the latest advancements and contribute
to the RGB-DT research field.Comment: 33 pages, 20 figure
{TOCH}: {S}patio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
Scene Reconstruction And Understanding By RGB-D Sensors
This thesis investigates interactive scene reconstruction and understanding using RGB-D data only. Indeed, we believe that depth cameras will still be in the near future a cheap and low-power 3D sensing alternative suitable for mobile devices too. Therefore, our contributions build on top of state-of-the-art approaches to achieve advances in three main challenging scenarios, namely mobile mapping, large scale surface reconstruction and semantic modeling. First, we will describe an effective approach dealing with Simultaneous Localization And Mapping (SLAM) on platforms with limited resources, such as a tablet device. Unlike previous methods, dense reconstruction is achieved by reprojection of RGB-D frames, while local consistency is maintained by deploying relative bundle adjustment principles. We will show quantitative results comparing our technique to the state-of-the-art as well as detailed reconstruction of various environments ranging from rooms to small apartments. Then, we will address large scale surface modeling from depth maps exploiting parallel GPU computing. We will develop a real-time camera tracking method based on the popular KinectFusion system and an online surface alignment technique capable of counteracting drift errors and closing small loops. We will show very high quality meshes outperforming existing methods on publicly available datasets as well as on data recorded with our RGB-D camera even in complete darkness. Finally, we will move to our Semantic Bundle Adjustment framework to effectively combine object detection and SLAM in a unified system. Though the mathematical framework we will describe does not restrict to a particular sensing technology, in the experimental section we will refer, again, only to RGB-D sensing. We will discuss successful implementations of our algorithm showing the benefit of a joint object detection, camera tracking and environment mapping
TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous works
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. TOCH fields are a point-wise, object-centric
representation, which encode the hand position relative to the object.
Leveraging this novel representation, we learn a latent manifold of plausible
TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate
that TOCH outperforms state-of-the-art 3D hand-object interaction models, which
are limited to static grasps and contacts. More importantly, our method
produces smooth interactions even before and after contact. Using a single
trained TOCH model, we quantitatively and qualitatively demonstrate its
usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D
hand-object reconstruction methods and transferring grasps across objects
FULL 3D RECONSTRUCTION OF DYNAMIC NON-RIGID SCENES: ACQUISITION AND ENHANCEMENT
Recent advances in commodity depth or 3D sensing technologies have enabled us to move
closer to the goal of accurately sensing and modeling the 3D representations of complex
dynamic scenes. Indeed, in domains such as virtual reality, security, surveillance and
e-health, there is now a greater demand for aff ordable and flexible vision systems which
are capable of acquiring high quality 3D reconstructions. Available commodity RGB-D
cameras, though easily accessible, have limited fi eld-of-view, and acquire noisy and low-resolution measurements which restricts their direct usage in building such vision systems.
This thesis targets these limitations and builds approaches around commodity 3D
sensing technologies to acquire noise-free and feature preserving full 3D reconstructions
of dynamic scenes containing, static or moving, rigid or non-rigid objects. A mono-view
system based on a single RGB-D camera is incapable of acquiring full 360 degrees 3D reconstruction of a dynamic scene instantaneously. For this purpose, a multi-view system
composed of several RGB-D cameras covering the whole scene is used. In the first part of
this thesis, the domain of correctly aligning the information acquired from RGB-D cameras
in a multi-view system to provide full and textured 3D reconstructions of dynamic
scenes, instantaneously, is explored. This is achieved by solving the extrinsic calibration
problem. This thesis proposes an extrinsic calibration framework which uses the 2D
photometric and 3D geometric information, acquired with RGB-D cameras, according
to their relative (in)accuracies, a ffected by the presence of noise, in a single weighted
bi-objective optimization. An iterative scheme is also proposed, which estimates the parameters
of noise model aff ecting both 2D and 3D measurements, and solves the extrinsic
calibration problem simultaneously. Results show improvement in calibration accuracy
as compared to state-of-art methods. In the second part of this thesis, the domain
of enhancement of noisy and low-resolution 3D data acquired with commodity RGB-D
cameras in both mono-view and multi-view systems is explored. This thesis extends
the state-of-art in mono-view template-free recursive 3D data enhancement which targets
dynamic scenes containing rigid-objects, and thus requires tracking only the global
motions of those objects for view-dependent surface representation and fi ltering. This
thesis proposes to target dynamic scenes containing non-rigid objects which introduces
the complex requirements of tracking relatively large local motions and maintaining data
organization for view-dependent surface representation. The proposed method is shown
to be e ffective in handling non-rigid objects of changing topologies. Building upon the
previous work, this thesis overcomes the requirement of data organization by proposing
an approach based on view-independent surface representation. View-independence
decreases the complexity of the proposed algorithm and allows it the flexibility to process
and enhance noisy data, acquired with multiple cameras in a multi-view system,
simultaneously. Moreover, qualitative and quantitative experimental analysis shows this
method to be more accurate in removing noise to produce enhanced 3D reconstructions
of non-rigid objects. Although, extending this method to a multi-view system would
allow for obtaining instantaneous enhanced full 360 degrees 3D reconstructions of non-rigid
objects, it still lacks the ability to explicitly handle low-resolution data. Therefore, this
thesis proposes a novel recursive dynamic multi-frame 3D super-resolution algorithm
together with a novel 3D bilateral total variation regularization to filter out the noise,
recover details and enhance the resolution of data acquired from commodity cameras in
a multi-view system. Results show that this method is able to build accurate, smooth
and feature preserving full 360 degrees 3D reconstructions of the dynamic scenes containing
non-rigid objects
- …