10 research outputs found
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Learning-based methods to solve dense 3D vision problems typically train on
3D sensor data. The respectively used principle of measuring distances provides
advantages and drawbacks. These are typically not compared nor discussed in the
literature due to a lack of multi-modal datasets. Texture-less regions are
problematic for structure from motion and stereo, reflective material poses
issues for active sensing, and distances for translucent objects are intricate
to measure with existing hardware. Training on inaccurate or corrupt data
induces model bias and hampers generalisation capabilities. These effects
remain unnoticed if the sensor measurement is considered as ground truth during
the evaluation. This paper investigates the effect of sensor errors for the
dense 3D vision tasks of depth estimation and reconstruction. We rigorously
show the significant impact of sensor characteristics on the learned
predictions and notice generalisation issues arising from various technologies
in everyday household environments. For evaluation, we introduce a carefully
designed dataset\footnote{dataset available at
https://github.com/Junggy/HAMMER-dataset} comprising measurements from
commodity sensors, namely D-ToF, I-ToF, passive/active stereo, and monocular
RGB+P. Our study quantifies the considerable sensor noise impact and paves the
way to improved dense vision estimates and targeted data fusion.Comment: Accepted at CVPR 2023, Main Paper + Supp. Mat. arXiv admin note:
substantial text overlap with arXiv:2205.0456
Jigsaw: Learning to Assemble Multiple Fractured Objects
Automated assembly of 3D fractures is essential in orthopedics, archaeology,
and our daily life. This paper presents Jigsaw, a novel framework for
assembling physically broken 3D objects from multiple pieces. Our approach
leverages hierarchical features of global and local geometry to match and align
the fracture surfaces. Our framework consists of three components: (1) surface
segmentation to separate fracture and original parts, (2) multi-parts matching
to find correspondences among fracture surface points, and (3) robust global
alignment to recover the global poses of the pieces. We show how to jointly
learn segmentation and matching and seamlessly integrate feature matching and
rigidity constraints. We evaluate Jigsaw on the Breaking Bad dataset and
achieve superior performance compared to state-of-the-art methods. Our method
also generalizes well to diverse fracture modes, objects, and unseen instances.
To the best of our knowledge, this is the first learning-based method designed
specifically for 3D fracture assembly over multiple pieces.Comment: 17 pages, 9 figure
ObjectMatch: Robust Registration using Canonical Object Correspondences
We present ObjectMatch, a semantic and object-centric camera pose estimator
for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct
correspondences of overlapping regions between frames; however, they cannot
align camera frames with little or no overlap. In this work, we propose to
leverage indirect correspondences obtained via semantic object identification.
For instance, when an object is seen from the front in one frame and from the
back in another frame, we can provide additional pose constraints through
canonical object correspondences. We first propose a neural network to predict
such correspondences on a per-pixel level, which we then combine in our energy
formulation with state-of-the-art keypoint matching solved with a joint
Gauss-Newton optimization. In a pairwise setting, our method improves
registration recall of state-of-the-art feature matching, including from 24% to
45% in pairs with 10% or less inter-frame overlap. In registering RGB-D
sequences, our method outperforms cutting-edge SLAM baselines in challenging,
low-frame-rate scenarios, achieving more than 35% reduction in trajectory error
in multiple scenes.Comment: Project Page: http://cangumeli.github.io/ObjectMatch Video:
https://www.youtube.com/watch?v=kuXoKVrzUR
Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge
Robot warehouse automation has attracted significant interest in recent years, perhaps most visibly in the Amazon Picking Challenge (APC) [1]. A fully autonomous warehouse pick-and-place system requires robust vision that reliably recognizes and locates objects amid cluttered environments, self-occlusions, sensor noise, and a large variety of objects. In this paper we present an approach that leverages multiview RGB-D data and self-supervised, data-driven learning to overcome those difficulties. The approach was part of the MIT-Princeton Team system that took 3rd- and 4th-place in the stowing and picking tasks, respectively at APC 2016. In the proposed approach, we segment and label multiple views of a scene with a fully convolutional neural network, and then fit pre-scanned 3D object models to the resulting segmentation to get the 6D object pose. Training a deep neural network for segmentation typically requires a large amount of training data. We propose a self-supervised method to generate a large labeled dataset without tedious manual segmentation. We demonstrate that our system can reliably estimate the 6D pose of objects under a variety of scenarios. All code, data, and benchmarks are available at http://apc.cs.princeton.edu
Q-REG: End-to-End Trainable Point Cloud Registration with Surface Curvature
Point cloud registration has seen recent success with several learning-based
methods that focus on correspondence matching and, as such, optimize only for
this objective. Following the learning step of correspondence matching, they
evaluate the estimated rigid transformation with a RANSAC-like framework. While
it is an indispensable component of these methods, it prevents a fully
end-to-end training, leaving the objective to minimize the pose error
nonserved. We present a novel solution, Q-REG, which utilizes rich geometric
information to estimate the rigid pose from a single correspondence. Q-REG
allows to formalize the robust estimation as an exhaustive search, hence
enabling end-to-end training that optimizes over both objectives of
correspondence matching and rigid pose estimation. We demonstrate in the
experiments that Q-REG is agnostic to the correspondence matching method and
provides consistent improvement both when used only in inference and in
end-to-end training. It sets a new state-of-the-art on the 3DMatch, KITTI, and
ModelNet benchmarks
High-fidelity Human Body Modelling from User-generated Data
PhD thesisBuilding high-fidelity human body models for real people benefits a variety of applications, like fashion, health, entertainment, education and ergonomics applications. The goal of this thesis is to build visually plausible human body models from two kinds of user-generated data: low-quality point clouds and low-resolution 2D images. Due to the advances in 3D scanning technology and the growing availability of cost-effective 3D scanners to general users, a full human body scan can be easily acquired within two minutes. However, due to the imperfections of scanning devices, occlusion, self-occlusion and untrained scanning operation, the acquired scans tend to be full of noise, holes (missing data), outliers and distorted parts. In this thesis, the establishment of shape correspondences for human body meshes is firstly investigated. A robust and shape-aware approach is proposed to detect accurate shape correspondences for closed human body meshes. By investigating the vertex movements of 200 human body meshes, a robust non-rigid mesh registration method is proposed which combines the human body shape model with the traditional nonrigid ICP. To facilitate the development and benchmarking of registration methods on Kinect Fusion data, a dataset of user-generated scansis built, named Kinect-based 3D Human Body (K3D-hub) Dataset, with one Microsoft Kinect for XBOX 360. Besides building 3D human body models from point clouds, the problem is also tackled which estimates accurate 3D human body models from single 2D images. A state-of-the-art parametric 3D human body model SMPL is fitted to 2D joints as well as the boundary of the human body. Fast Region based CNN and deep CNN based methods are adopted to detect the 2D joints and boundary for each human body image automatically. Considering the commonly encountered scenario where people are in stable poses at most of the time, a stable pose prior is introduced from CMU motion capture (mocap) dataset for further improving the accuracy of pose estimation
Registration of non-rigidly deforming objects
This thesis investigates the current state-of-the-art in registration of non-rigidly deforming shapes. In particular, the problem of non-isometry is considered. First, a method to address locally anisotropic deformation is proposed. The subsequent evaluation of this method highlights a lack of resources for evaluating such methods.
Three novel registration/shape correspondence benchmark datasets are developed for assessing different aspects of non-rigid deformation. Deficiencies in current evaluative measures are identified, leading to the development of a new performance measure that effectively communicates the density and distribution of correspondences.
Finally, the problem of transferring skull orbit labels between scans is examined on a database of unlabelled skulls. A novel pipeline that mitigates errors caused by coarse representations is proposed
Towards Quantitative Endoscopy with Vision Intelligence
In this thesis, we work on topics related to quantitative endoscopy with vision-based intelligence. Specifically, our works revolve around the topic of video reconstruction in endoscopy, where many challenges exist, such as texture scarceness, illumination variation, multimodality, etc., and these prevent prior works from working effectively and robustly. To this end, we propose to combine the strength of expressivity of deep learning approaches and the rigorousness and accuracy of non-linear optimization algorithms to develop a series of methods to confront such challenges towards quantitative endoscopy. We first propose a retrospective sparse reconstruction method that can estimate a high-accuracy and density point cloud and high-completeness camera trajectory from a monocular endoscopic video with state-of-the-art performance. To enable this, replacing the role of a hand-crafted local descriptor, a deep image feature descriptor is developed to boost the feature matching performance in a typical sparse reconstruction algorithm. A retrospective surface reconstruction pipeline is then proposed to estimate a textured surface model from a monocular endoscopic video, where self-supervised depth and descriptor learning and surface fusion technique is involved. We show that the proposed method performs superior to a popular dense reconstruction method and the estimate reconstructions are in good agreement with the surface models obtained from CT scans. To align video-reconstructed surface models with pre-operative imaging such as CT, we introduce a global point cloud registration algorithm that is robust to resolution mismatch that often happens in such multi-modal scenarios. Specifically, a geometric feature descriptor is developed where a novel network normalization technique is used to help a 3D network produce more consistent and distinctive geometric features for samples with different resolutions. The proposed geometric descriptor achieves state-of-the-art performance, based on our evaluation. Last but not least, a real-time SLAM system that estimates a surface geometry and camera trajectory from a monocular endoscopic video is developed, where deep representations for geometry and appearance and non-linear factor graph optimization are used. We show that the proposed SLAM system performs favorably compared with a state-of-the-art feature-based SLAM system
Recommended from our members
Understanding and Facilitating Human-AI Teaming for Real-World Computer Vision Tasks
Recent machine learning research has demonstrated that many task-specific AI models now reach or surpass human performance on static benchmarks. However, in real-world applications where human users collaborate with, or rely on AIs, key questions remain: Do these advancements in AI models inherently improve the user experience or augment users' capabilities? When and how should we partner users with AI to form effective human-AI teams? This dissertation explores new forms of human-AI collaboration in the context of real-world computer vision tasks. We demonstrate different user roles in diverse AI-assisted workflows -- from passive recipients of AI model outputs to active participants who steer the shaping of the model. 1) We developed intuitive user interfaces to make deep learning accessible to end users, in this case astrophysicists, without requiring knowledge in machine learning. The end-to-end model enhances the accuracy of automated processing of daily space observations from 20+ telescopes globally. The streamlined interface injects confidence into researchers' AI-supported analysis of scientific imagery. 2) We proposed the concept of "restrained and zealous AIs" to harness the complementary strength in human-AI teams. Insights from a month-long user study involving 78 professional data annotators suggest that recommendations from ill-suited AI counterparts may detrimentally affect users' skills. 3) Finally, we brought a novel concept of "in-situ learning" to augmented reality, where the user interacts with physical objects to train spatially-aware AI models that can remember the personalized environment and objects for various tasks. Each project brings the end user to a more active and engaged role in the inference, training, and evaluation processes of human-in-the-loop machine learning. In summary, this dissertation provides insights into good practices for teaming humans with AI for real-world collaboration, informing the design of future AI-assisted systems