Robust Scene Estimation for Goal-directed Robotic Manipulation in Unstructured Environments

Abstract

To make autonomous robots "taskable" so that they function properly and interact fluently with human partners, they must be able to perceive and understand the semantic aspects of their environments. More specifically, they must know what objects exist and where they are in the unstructured human world. Progresses in robot perception, especially in deep learning, have greatly improved for detecting and localizing objects. However, it still remains a challenge for robots to perform a highly reliable scene estimation in unstructured environments that is determined by robustness, adaptability and scale. In this dissertation, we address the scene estimation problem under uncertainty, especially in unstructured environments. We enable robots to build a reliable object-oriented representation that describes objects present in the environment, as well as inter-object spatial relations. Specifically, we focus on addressing following challenges for reliable scene estimation: 1) robust perception under uncertainty results from noisy sensors, objects in clutter and perceptual aliasing, 2) adaptable perception in adverse conditions by combined deep learning and probabilistic generative methods, 3) scalable perception as the number of objects grows and the structure of objects becomes more complex (e.g. objects in dense clutter). Towards realizing robust perception, our objective is to ground raw sensor observations into scene states while dealing with uncertainty from sensor measurements and actuator control . Scene states are represented as scene graphs, where scene graphs denote parameterized axiomatic statements that assert relationships between objects and their poses. To deal with the uncertainty, we present a pure generative approach, Axiomatic Scene Estimation (AxScEs). AxScEs estimates a probabilistic distribution across plausible scene graph hypotheses describing the configuration of objects. By maintaining a diverse set of possible states, the proposed approach demonstrates the robustness to the local minimum in the scene graph state space and effectiveness for manipulation-quality perception based on edit distance on scene graphs. To scale up to more unstructured scenarios and be adaptable to adversarial scenarios, we present Sequential Scene Understanding and Manipulation (SUM), which estimates the scene as a collection of objects in cluttered environments. SUM is a two-stage method that leverages the accuracy and efficiency from convolutional neural networks (CNNs) with probabilistic inference methods. Despite the strength from CNNs, they are opaque in understanding how the decisions are made and fragile for generalizing beyond overfit training samples in adverse conditions (e.g., changes in illumination). The probabilistic generative method complements these weaknesses and provides an avenue for adaptable perception. To scale up to densely cluttered environments where objects are physically touching with severe occlusions, we present GeoFusion, which fuses noisy observations from multiple frames by exploring geometric consistency at object level. Geometric consistency characterizes geometric compatibility between objects and geometric similarity between observations and objects. It reasons about geometry at the object-level, offering a fast and reliable way to be robust to semantic perceptual aliasing. The proposed approach demonstrates greater robustness and accuracy than the state-of-the-art pose estimation approach.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163060/1/zsui_1.pd

    Similar works

    Full text

    thumbnail-image