89 research outputs found
Segmentation and Recovery of Superquadric Models using Convolutional Neural Networks
In this paper we address the problem of representing 3D visual data with
parameterized volumetric shape primitives. Specifically, we present a
(two-stage) approach built around convolutional neural networks (CNNs) capable
of segmenting complex depth scenes into the simpler geometric structures that
can be represented with superquadric models. In the first stage, our approach
uses a Mask RCNN model to identify superquadric-like structures in depth scenes
and then fits superquadric models to the segmented structures using a specially
designed CNN regressor. Using our approach we are able to describe complex
structures with a small number of interpretable parameters. We evaluated the
proposed approach on synthetic as well as real-world depth data and show that
our solution does not only result in competitive performance in comparison to
the state-of-the-art, but is able to decompose scenes into a number of
superquadric models at a fraction of the time required by competing approaches.
We make all data and models used in the paper available from
https://lmi.fe.uni-lj.si/en/research/resources/sq-seg.Comment: 8 pages, in Computer Vision Winter Workshop, 202
Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
We propose an unsupervised method for parsing large 3D scans of real-world
scenes into interpretable parts. Our goal is to provide a practical tool for
analyzing 3D scenes with unique characteristics in the context of aerial
surveying and mapping, without relying on application-specific user
annotations. Our approach is based on a probabilistic reconstruction model that
decomposes an input 3D point cloud into a small set of learned prototypical
shapes. Our model provides an interpretable reconstruction of complex scenes
and leads to relevant instance and semantic segmentations. To demonstrate the
usefulness of our results, we introduce a novel dataset of seven diverse aerial
LiDAR scans. We show that our method outperforms state-of-the-art unsupervised
methods in terms of decomposition accuracy while remaining visually
interpretable. Our method offers significant advantage over existing
approaches, as it does not require any manual annotations, making it a
practical and efficient tool for 3D scene analysis. Our code and dataset are
available at https://imagine.enpc.fr/~loiseaur/learnable-earth-parse
Part-Object Relational Visual Saliency
Recent years have witnessed a big leap in automatic visual saliency detection attributed to advances in deep learning, especially Convolutional Neural Networks (CNNs). However, inferring the saliency of each image part separately, as was adopted by most CNNs methods, inevitably leads to an incomplete segmentation of the salient object. In this paper, we describe how to use the property of part-object relations endowed by the Capsule Network (CapsNet) to solve the problems that fundamentally hinge on relational inference for visual saliency detection. Concretely, we put in place a two-stream strategy, termed Two-Stream Part-Object RelaTional Network (TSPORTNet), to implement CapsNet, aiming to reduce both the network complexity and the possible redundancy during capsule routing. Additionally, taking into account the correlations of capsule types from the preceding training images, a correlation-aware capsule routing algorithm is developed for more accurate capsule assignments at the training stage, which also speeds up the training dramatically. By exploring part-object relationships, TSPORTNet produces a capsule wholeness map, which in turn aids multi-level features in generating the final saliency map. Experimental results on five widely-used benchmarks show that our framework consistently achieves state-of-the-art performance. The code can be found on https://github.com/liuyi1989/TSPORTNet
Massively Parallel Approach to Modeling 3D Objects in Machine Vision
Electrical Engineerin
Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View
We propose a method for predicting the 3D shape of a deformable surface from
a single view. By contrast with previous approaches, we do not need a
pre-registered template of the surface, and our method is robust to the lack of
texture and partial occlusions. At the core of our approach is a {\it
geometry-aware} deep architecture that tackles the problem as usually done in
analytic solutions: first perform 2D detection of the mesh and then estimate a
3D shape that is geometrically consistent with the image. We train this
architecture in an end-to-end manner using a large dataset of synthetic
renderings of shapes under different levels of deformation, material
properties, textures and lighting conditions. We evaluate our approach on a
test split of this dataset and available real benchmarks, consistently
improving state-of-the-art solutions with a significantly lower computational
time.Comment: Accepted at CVPR 201
Investigating Scene Understanding for Robotic Grasping: From Pose Estimation to Explainable AI
In the rapidly evolving field of robotics, the ability to accurately grasp and manipulate objects—known as robotic grasping—is a cornerstone of autonomous operation. This capability is pivotal across a multitude of applications, from industrial manufacturing automation to supply chain management, and is a key determinant of a robot's ability to interact effectively with its environment. Central to this capability is the concept of scene understanding, a complex task that involves interpreting the robot's environment to facilitate decision-making and action planning. This thesis presents a comprehensive exploration of scene understanding for robotic grasping, with a particular emphasis on pose estimation, a critical aspect of scene understanding.
Pose estimation, the process of determining the position and orientation of objects within the robot's environment, is a crucial component of robotic grasping. It provides the robot with the necessary spatial information about the objects in the scene, enabling it to plan and execute grasping actions effectively. However, many current pose estimation methods provide relative pose compared to a 3D model, which lacks descriptiveness without referencing the 3D model. This thesis explores the use of keypoints and superquadrics as more general and descriptive representations of an object's pose. These novel approaches address the limitations of traditional methods and significantly enhance the generalizability and descriptiveness of pose estimation, thereby improving the overall effectiveness of robotic grasping.
In addition to pose estimation, this thesis briefly touches upon the importance of uncertainty estimation and explainable AI in the context of robotic grasping. It introduces the concept of multimodal consistency for uncertainty estimation, providing a reliable measure of uncertainty that can enhance decision-making in human-in-the-loop situations. Furthermore, it explores the realm of explainable AI, presenting a method for gaining deeper insights into deep learning models, thereby enhancing their transparency and interpretability.
In summary, this thesis presents a comprehensive approach to scene understanding for robotic grasping, with a particular emphasis on pose estimation. It addresses key challenges and advances the state of the art in this critical area of robotics research. The research is structured around five published papers, each contributing to a unique aspect of the overall study
- …