136 research outputs found
Multi-View Fusion for Multi-Level Robotic Scene Understanding
We present a system for multi-level scene awareness for robotic manipulation.
Given a sequence of camera-in-hand RGB images, the system calculates three
types of information: 1) a point cloud representation of all the surfaces in
the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown
objects from categories corresponding to primitive shapes (e.g., cuboids and
cylinders); and 3) full 6-DoF pose of known objects. By developing and fusing
recent techniques in these domains, we provide a rich scene representation for
robot awareness. We demonstrate the importance of each of these modules, their
complementary nature, and the potential benefits of the system in the context
of robotic manipulation.Comment: Presented at IROS 2021. Video is at https://youtu.be/FuqMxuODGl
Recovering 6D Object Pose: A Review and Multi-modal Analysis
A large number of studies analyse object detection and pose estimation at
visual level in 2D, discussing the effects of challenges such as occlusion,
clutter, texture, etc., on the performances of the methods, which work in the
context of RGB modality. Interpreting the depth data, the study in this paper
presents thorough multi-modal analyses. It discusses the above-mentioned
challenges for full 6D object pose estimation in RGB-D images comparing the
performances of several 6D detectors in order to answer the following
questions: What is the current position of the computer vision community for
maintaining "automation" in robotic manipulation? What next steps should the
community take for improving "autonomy" in robotics while handling objects? Our
findings include: (i) reasonably accurate results are obtained on
textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy
existence of occlusion and clutter severely affects the detectors, and
similar-looking distractors is the biggest challenge in recovering instances'
6D. (iii) Template-based methods and random forest-based learning algorithms
underlie object detection and 6D pose estimation. Recent paradigm is to learn
deep discriminative feature representations and to adopt CNNs taking RGB images
as input. (iv) Depending on the availability of large-scale 6D annotated depth
datasets, feature representations can be learnt on these datasets, and then the
learnt representations can be customized for the 6D problem
A Dataset for Developing and Benchmarking Active Vision
We present a new public dataset with a focus on simulating robotic vision
tasks in everyday indoor environments using real imagery. The dataset includes
20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely
captured in 9 unique scenes. We train a fast object category detector for
instance detection on our data. Using the dataset we show that, although
increasingly accurate and fast, the state of the art for object detection is
still severely impacted by object scale, occlusion, and viewing direction all
of which matter for robotics applications. We next validate the dataset for
simulating active vision, and use the dataset to develop and evaluate a
deep-network-based system for next best move prediction for object
classification using reinforcement learning. Our dataset is available for
download at cs.unc.edu/~ammirato/active_vision_dataset_website/.Comment: To appear at ICRA 201
Tight Robot Packing in the Real World: A Complete Manipulation Pipeline with Robust Primitives
Many order fulfillment applications in logistics, such as packing, involve
picking objects from unstructured piles before tightly arranging them in bins
or shipping containers. Desirable robotic solutions in this space need to be
low-cost, robust, easily deployable and simple to control. The current work
proposes a complete pipeline for solving packing tasks for cuboid objects,
given access only to RGB-D data and a single robot arm with a vacuum-based
end-effector, which is also used as a pushing or dragging finger. The pipeline
integrates perception for detecting the objects and planning so as to properly
pick and place objects. The key challenges correspond to sensing noise and
failures in execution, which appear at multiple steps of the process. To
achieve robustness, three uncertainty-reducing manipulation primitives are
proposed, which take advantage of the end-effector's and the workspace's
compliance, to successfully and tightly pack multiple cuboid objects. The
overall solution is demonstrated to be robust to execution and perception
errors. The impact of each manipulation primitive is evaluated in extensive
real-world experiments by considering different versions of the pipeline.
Furthermore, an open-source simulation framework is provided for modeling such
packing operations. Ablation studies are performed within this simulation
environment to evaluate features of the proposed primitives
Active recognition and pose estimation of rigid and deformable objects in 3D space
Object recognition and pose estimation is a fundamental problem in computer vision and of utmost importance in robotic applications. Object recognition refers to the problem of recognizing certain object instances, or categorizing objects into specific classes. Pose estimation deals with estimating the exact position of the object in 3D space, usually expressed in Euler angles. There are generally two types of objects that require special care when designing solutions to the aforementioned problems: rigid and deformable. Dealing with deformable objects has been a much harder problem, and usually solutions that apply to rigid objects, fail when used for deformable objects due to the inherent assumptions made during the design.
In this thesis we deal with object categorization, instance recognition and pose estimation of both rigid and deformable objects. In particular, we are interested in a special type of deformable objects, clothes. We tackle the problem of autonomously recognizing and unfolding articles of clothing using a dual manipulator. This problem consists of grasping an article from a random point, recognizing it and then bringing it into an unfolded state by a dual arm robot. We propose a data-driven method for clothes recognition from depth images using Random Decision Forests. We also propose a method for unfolding an article of clothing after estimating and grasping two key-points, using Hough Forests. Both methods are implemented into a POMDP framework allowing the robot to interact optimally with the garments, taking into account uncertainty in the recognition and point estimation process. This active recognition and unfolding makes our system very robust to noisy observations. Our methods were tested on regular-sized clothes using a dual-arm manipulator. Our systems perform better in both accuracy and speed compared to state-of-the-art approaches.
In order to take advantage of the robotic manipulator and increase the accuracy of our system, we developed a novel approach to address generic active vision problems, called Active Random Forests. While state of the art focuses on best viewing parameters selection based on single view classifiers, we propose a multi-view classifier where the decision mechanism of optimally changing viewing parameters is inherent to the classification process. This has many advantages: a) the classifier exploits the entire set of captured images and does not simply aggregate probabilistically per view hypotheses; b) actions are based on learnt disambiguating features from all views and are optimally selected using the powerful voting scheme of Random Forests and c) the classifier can take into account the costs of actions. The proposed framework was applied to the same task of autonomously unfolding clothes by a robot, addressing the problem of best viewpoint selection in classification, grasp point and pose estimation of garments. We show great performance improvement compared to state of the art methods and our previous POMDP formulation.
Moving from deformable to rigid objects while keeping our interest to domestic robotic applications, we focus on object instance recognition and 3D pose estimation of household objects. We are particularly interested in realistic scenes that are very crowded and objects can be perceived under severe occlusions. Single shot-based 6D pose estimators with manually designed features are still unable to tackle such difficult scenarios for a variety of objects, motivating the research towards unsupervised feature learning and next-best-view estimation. We present a complete framework for both single shot-based 6D object pose estimation and next-best-view prediction based on Hough Forests, the state of the art object pose estimator that performs classification and regression jointly. Rather than using manually designed features we propose an unsupervised feature learnt from depth-invariant patches using a Sparse Autoencoder. Furthermore, taking advantage of the clustering performed in the leaf nodes of Hough Forests, we learn to estimate the reduction of uncertainty in other views, formulating the problem of selecting the next-best-view. To further improve 6D object pose estimation, we propose an improved joint registration and hypotheses verification module as a final refinement step to reject false detections. We provide two additional challenging datasets inspired from realistic scenarios to extensively evaluate the state of the art and our framework. One is related to domestic environments and the other depicts a bin-picking scenario mostly found in industrial settings. We show that our framework significantly outperforms state of the art both on public and on our datasets.
Unsupervised feature learning, although efficient, might produce sub-optimal features for our particular tast. Therefore in our last work, we leverage the power of Convolutional Neural Networks to tackled the problem of estimating the pose of rigid objects by an end-to-end deep regression network. To improve the moderate performance of the standard regression objective function, we introduce the Siamese Regression Network. For a given image pair, we enforce a similarity measure between the representation of the sample images in the feature and pose space respectively, that is shown to boost regression performance. Furthermore, we argue that our pose-guided feature learning using our Siamese Regression Network generates more discriminative features that outperform the state of the art. Last, our feature learning formulation provides the ability of learning features that can perform under severe occlusions, demonstrating high performance on our novel hand-object dataset.
Concluding, this work is a research on the area of object detection and pose estimation in 3D space, on a variety of object types. Furthermore we investigate how accuracy can be further improved by applying active vision techniques to optimally move the camera view to minimize the detection error.Open Acces
Learning Kinematic Descriptions using SPARE: Simulated and Physical ARticulated Extendable dataset
Next generation robots will need to understand intricate and articulated
objects as they cooperate in human environments. To do so, these robots will
need to move beyond their current abilities--- working with relatively simple
objects in a task-indifferent manner--- toward more sophisticated abilities
that dynamically estimate the properties of complex, articulated objects. To
that end, we make two compelling contributions toward general articulated
(physical) object understanding in this paper. First, we introduce a new
dataset, SPARE: Simulated and Physical ARticulated Extendable dataset. SPARE is
an extendable open-source dataset providing equivalent simulated and physical
instances of articulated objects (kinematic chains), providing the greater
research community with a training and evaluation tool for methods generating
kinematic descriptions of articulated objects. To the best of our knowledge,
this is the first joint visual and physical (3D-printable) dataset for the
Vision community. Second, we present a deep neural network that can predit the
number of links and the length of the links of an articulated object. These new
ideas outperform classical approaches to understanding kinematic chains, such
tracking-based methods, which fail in the case of occlusion and do not leverage
multiple views when available
Data-Efficient Learning for Sim-to-Real Robotic Grasping using Deep Point Cloud Prediction Networks
Training a deep network policy for robot manipulation is notoriously costly
and time consuming as it depends on collecting a significant amount of real
world data. To work well in the real world, the policy needs to see many
instances of the task, including various object arrangements in the scene as
well as variations in object geometry, texture, material, and environmental
illumination.
In this paper, we propose a method that learns to perform table-top instance
grasping of a wide variety of objects while using no real world grasping data,
outperforming the baseline using 2.5D shape by 10%. Our method learns 3D point
cloud of object, and use that to train a domain-invariant grasping policy. We
formulate the learning process as a two-step procedure: 1) Learning a
domain-invariant 3D shape representation of objects from about 76K episodes in
simulation and about 530 episodes in the real world, where each episode lasts
less than a minute and 2) Learning a critic grasping policy in simulation only
based on the 3D shape representation from step 1. Our real world data
collection in step 1 is both cheaper and faster compared to existing approaches
as it only requires taking multiple snapshots of the scene using a RGBD camera.
Finally, the learned 3D representation is not specific to grasping, and can
potentially be used in other interaction tasks
Active 6D Multi-Object Pose Estimation in Cluttered Scenarios with Deep Reinforcement Learning
In this work, we explore how a strategic selection of camera movements can
facilitate the task of 6D multi-object pose estimation in cluttered scenarios
while respecting real-world constraints important in robotics and augmented
reality applications, such as time and distance traveled. In the proposed
framework, a set of multiple object hypotheses is given to an agent, which is
inferred by an object pose estimator and subsequently spatio-temporally
selected by a fusion function that makes use of a verification score that
circumvents the need of ground-truth annotations. The agent reasons about these
hypotheses, directing its attention to the object which it is most uncertain
about, moving the camera towards such an object. Unlike previous works that
propose short-sighted policies, our agent is trained in simulated scenarios
using reinforcement learning, attempting to learn the camera moves that produce
the most accurate object poses hypotheses for a given temporal and spatial
budget, without the need of viewpoints rendering during inference. Our
experiments show that the proposed approach successfully estimates the 6D
object pose of a stack of objects in both challenging cluttered synthetic and
real scenarios, showing superior performance compared to strong baselines
Robust Scene Estimation for Goal-directed Robotic Manipulation in Unstructured Environments
To make autonomous robots "taskable" so that they function properly and interact fluently with human partners, they must be able to perceive and understand the semantic aspects of their environments. More specifically, they must know what objects exist and where they are in the unstructured human world. Progresses in robot perception, especially in deep learning, have greatly improved for detecting and localizing objects. However, it still remains a challenge for robots to perform a highly reliable scene estimation in unstructured environments that is determined by robustness, adaptability and scale. In this dissertation, we address the scene estimation problem under uncertainty, especially in unstructured environments. We enable robots to build a reliable object-oriented representation that describes objects present in the environment, as well as inter-object spatial relations. Specifically, we focus on addressing following challenges for reliable scene estimation: 1) robust perception under uncertainty results from noisy sensors, objects in clutter and perceptual aliasing, 2) adaptable perception in adverse conditions by combined deep learning and probabilistic generative methods, 3) scalable perception as the number of objects grows and the structure of objects becomes more complex (e.g. objects in dense clutter).
Towards realizing robust perception, our objective is to ground raw sensor observations into scene states while dealing with uncertainty from sensor measurements and actuator control . Scene states are represented as scene graphs, where scene graphs denote parameterized axiomatic statements that assert relationships between objects and their poses. To deal with the uncertainty, we present a pure generative approach, Axiomatic Scene Estimation (AxScEs). AxScEs estimates a probabilistic distribution across plausible scene graph hypotheses describing the configuration of objects. By maintaining a diverse set of possible states, the proposed approach demonstrates the robustness to the local minimum in the scene graph state space and effectiveness for manipulation-quality perception based on edit distance on scene graphs.
To scale up to more unstructured scenarios and be adaptable to adversarial scenarios, we present Sequential Scene Understanding and Manipulation (SUM), which estimates the scene as a collection of objects in cluttered environments. SUM is a two-stage method that leverages the accuracy and efficiency from convolutional neural networks (CNNs) with probabilistic inference methods. Despite the strength from CNNs, they are opaque in understanding how the decisions are made and fragile for generalizing beyond overfit training samples in adverse conditions (e.g., changes in illumination). The probabilistic generative method complements these weaknesses and provides an avenue for adaptable perception.
To scale up to densely cluttered environments where objects are physically touching with severe occlusions, we present GeoFusion, which fuses noisy observations from multiple frames by exploring geometric consistency at object level. Geometric consistency characterizes geometric compatibility between objects and geometric similarity between observations and objects. It reasons about geometry at the object-level, offering a fast and reliable way to be robust to semantic perceptual aliasing. The proposed approach demonstrates greater robustness and accuracy than the state-of-the-art pose estimation approach.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163060/1/zsui_1.pd
Estimating Metric Poses of Dynamic Objects Using Monocular Visual-Inertial Fusion
A monocular 3D object tracking system generally has only up-to-scale pose
estimation results without any prior knowledge of the tracked object. In this
paper, we propose a novel idea to recover the metric scale of an arbitrary
dynamic object by optimizing the trajectory of the objects in the world frame,
without motion assumptions. By introducing an additional constraint in the time
domain, our monocular visual-inertial tracking system can obtain continuous six
degree of freedom (6-DoF) pose estimation without scale ambiguity. Our method
requires neither fixed multi-camera nor depth sensor settings for scale
observability, instead, the IMU inside the monocular sensing suite provides
scale information for both camera itself and the tracked object. We build the
proposed system on top of our monocular visual-inertial system (VINS) to obtain
accurate state estimation of the monocular camera in the world frame. The whole
system consists of a 2D object tracker, an object region-based visual bundle
adjustment (BA), VINS and a correlation analysis-based metric scale estimator.
Experimental comparisons with ground truth demonstrate the tracking accuracy of
our 3D tracking performance while a mobile augmented reality (AR) demo shows
the feasibility of potential applications.Comment: IROS 201
- …