6,301 research outputs found

    Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes

    Full text link
    We present the Semantic Robot Programming (SRP) paradigm as a convergence of robot programming by demonstration and semantic mapping. In SRP, a user can directly program a robot manipulator by demonstrating a snapshot of their intended goal scene in workspace. The robot then parses this goal as a scene graph comprised of object poses and inter-object relations, assuming known object geometries. Task and motion planning is then used to realize the user's goal from an arbitrary initial scene configuration. Even when faced with different initial scene configurations, SRP enables the robot to seamlessly adapt to reach the user's demonstrated goal. For scene perception, we propose the Discriminatively-Informed Generative Estimation of Scenes and Transforms (DIGEST) method to infer the initial and goal states of the world from RGBD images. The efficacy of SRP with DIGEST perception is demonstrated for the task of tray-setting with a Michigan Progress Fetch robot. Scene perception and task execution are evaluated with a public household occlusion dataset and our cluttered scene dataset.Comment: published in ICRA 201

    Data-Driven Grasp Synthesis - A Survey

    Full text link
    We review the work on data-driven grasp synthesis and the methodologies for sampling and ranking candidate grasps. We divide the approaches into three groups based on whether they synthesize grasps for known, familiar or unknown objects. This structure allows us to identify common object representations and perceptual processes that facilitate the employed data-driven grasp synthesis technique. In the case of known objects, we concentrate on the approaches that are based on object recognition and pose estimation. In the case of familiar objects, the techniques use some form of a similarity matching to a set of previously encountered objects. Finally for the approaches dealing with unknown objects, the core part is the extraction of specific features that are indicative of good grasps. Our survey provides an overview of the different methodologies and discusses open problems in the area of robot grasping. We also draw a parallel to the classical approaches that rely on analytic formulations.Comment: 20 pages, 30 Figures, submitted to IEEE Transactions on Robotic

    Synthesizing Training Data for Object Detection in Indoor Scenes

    Full text link
    Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training data which is time consuming and costly to obtain. In this work we explore the ability of using synthetically generated composite images for training state-of-the-art object detectors, especially for object instance detection. We superimpose 2D images of textured object models into images of real environments at variety of locations and scales. Our experiments evaluate different superimposition strategies ranging from purely image-based blending all the way to depth and semantics informed positioning of the object models into real scenes. We demonstrate the effectiveness of these object detector training strategies on two publicly available datasets, the GMU-Kitchens and the Washington RGB-D Scenes v2. As one observation, augmenting some hand-labeled training data with synthetic examples carefully composed onto scenes yields object detectors with comparable performance to using much more hand-labeled data. Broadly, this work charts new opportunities for training detectors for new objects by exploiting existing object model repositories in either a purely automatic fashion or with only a very small number of human-annotated examples.Comment: Added more experiments and link to project webpag

    Robust Scene Estimation for Goal-directed Robotic Manipulation in Unstructured Environments

    Full text link
    To make autonomous robots "taskable" so that they function properly and interact fluently with human partners, they must be able to perceive and understand the semantic aspects of their environments. More specifically, they must know what objects exist and where they are in the unstructured human world. Progresses in robot perception, especially in deep learning, have greatly improved for detecting and localizing objects. However, it still remains a challenge for robots to perform a highly reliable scene estimation in unstructured environments that is determined by robustness, adaptability and scale. In this dissertation, we address the scene estimation problem under uncertainty, especially in unstructured environments. We enable robots to build a reliable object-oriented representation that describes objects present in the environment, as well as inter-object spatial relations. Specifically, we focus on addressing following challenges for reliable scene estimation: 1) robust perception under uncertainty results from noisy sensors, objects in clutter and perceptual aliasing, 2) adaptable perception in adverse conditions by combined deep learning and probabilistic generative methods, 3) scalable perception as the number of objects grows and the structure of objects becomes more complex (e.g. objects in dense clutter). Towards realizing robust perception, our objective is to ground raw sensor observations into scene states while dealing with uncertainty from sensor measurements and actuator control . Scene states are represented as scene graphs, where scene graphs denote parameterized axiomatic statements that assert relationships between objects and their poses. To deal with the uncertainty, we present a pure generative approach, Axiomatic Scene Estimation (AxScEs). AxScEs estimates a probabilistic distribution across plausible scene graph hypotheses describing the configuration of objects. By maintaining a diverse set of possible states, the proposed approach demonstrates the robustness to the local minimum in the scene graph state space and effectiveness for manipulation-quality perception based on edit distance on scene graphs. To scale up to more unstructured scenarios and be adaptable to adversarial scenarios, we present Sequential Scene Understanding and Manipulation (SUM), which estimates the scene as a collection of objects in cluttered environments. SUM is a two-stage method that leverages the accuracy and efficiency from convolutional neural networks (CNNs) with probabilistic inference methods. Despite the strength from CNNs, they are opaque in understanding how the decisions are made and fragile for generalizing beyond overfit training samples in adverse conditions (e.g., changes in illumination). The probabilistic generative method complements these weaknesses and provides an avenue for adaptable perception. To scale up to densely cluttered environments where objects are physically touching with severe occlusions, we present GeoFusion, which fuses noisy observations from multiple frames by exploring geometric consistency at object level. Geometric consistency characterizes geometric compatibility between objects and geometric similarity between observations and objects. It reasons about geometry at the object-level, offering a fast and reliable way to be robust to semantic perceptual aliasing. The proposed approach demonstrates greater robustness and accuracy than the state-of-the-art pose estimation approach.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163060/1/zsui_1.pd

    Efficient Belief Propagation for Perception and Manipulation in Clutter

    Full text link
    Autonomous service robots are required to perform tasks in common human indoor environments. To achieve goals associated with these tasks, the robot should continually perceive, reason its environment, and plan to manipulate objects, which we term as goal-directed manipulation. Perception remains the most challenging aspect of all stages, as common indoor environments typically pose problems in recognizing objects under inherent occlusions with physical interactions among themselves. Despite recent progress in the field of robot perception, accommodating perceptual uncertainty due to partial observations remains challenging and needs to be addressed to achieve the desired autonomy. In this dissertation, we address the problem of perception under uncertainty for robot manipulation in cluttered environments using generative inference methods. Specifically, we aim to enable robots to perceive partially observable environments by maintaining an approximate probability distribution as a belief over possible scene hypotheses. This belief representation captures uncertainty resulting from inter-object occlusions and physical interactions, which are inherently present in clutterred indoor environments. The research efforts presented in this thesis are towards developing appropriate state representations and inference techniques to generate and maintain such belief over contextually plausible scene states. We focus on providing the following features to generative inference while addressing the challenges due to occlusions: 1) generating and maintaining plausible scene hypotheses, 2) reducing the inference search space that typically grows exponentially with respect to the number of objects in a scene, 3) preserving scene hypotheses over continual observations. To generate and maintain plausible scene hypotheses, we propose physics informed scene estimation methods that combine a Newtonian physics engine within a particle based generative inference framework. The proposed variants of our method with and without a Monte Carlo step showed promising results on generating and maintaining plausible hypotheses under complete occlusions. We show that estimating such scenarios would not be possible by the commonly adopted 3D registration methods without the notion of a physical context that our method provides. To scale up the context informed inference to accommodate a larger number of objects, we describe a factorization of scene state into object and object-parts to perform collaborative particle-based inference. This resulted in the Pull Message Passing for Nonparametric Belief Propagation (PMPNBP) algorithm that caters to the demands of the high-dimensional multimodal nature of cluttered scenes while being computationally tractable. We demonstrate that PMPNBP is orders of magnitude faster than the state-of-the-art Nonparametric Belief Propagation method. Additionally, we show that PMPNBP successfully estimates poses of articulated objects under various simulated occlusion scenarios. To extend our PMPNBP algorithm for tracking object states over continuous observations, we explore ways to propose and preserve hypotheses effectively over time. This resulted in an augmentation-selection method, where hypotheses are drawn from various proposals followed by the selection of a subset using PMPNBP that explained the current state of the objects. We discuss and analyze our augmentation-selection method with its counterparts in belief propagation literature. Furthermore, we develop an inference pipeline for pose estimation and tracking of articulated objects in clutter. In this pipeline, the message passing module with the augmentation-selection method is informed by segmentation heatmaps from a trained neural network. In our experiments, we show that our proposed pipeline can effectively maintain belief and track articulated objects over a sequence of observations under occlusion.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163159/1/kdesingh_1.pd

    GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning

    Full text link
    A robot in a human-centric environment needs to account for the human's intent and future motion in its task and motion planning to ensure safe and effective operation. This requires symbolic reasoning about probable future actions and the ability to tie these actions to specific locations in the physical environment. While one can train behavioral models capable of predicting human motion from past activities, this approach requires large amounts of data to achieve acceptable long-horizon predictions. More importantly, the resulting models are constrained to specific data formats and modalities. Moreover, connecting predictions from such models to the environment at hand to ensure the applicability of these predictions is an unsolved problem. We present a system that utilizes a Large Language Model (LLM) to infer a human's next actions from a range of modalities without fine-tuning. A novel aspect of our system that is critical to robotics applications is that it links the predicted actions to specific locations in a semantic map of the environment. Our method leverages the fact that LLMs, trained on a vast corpus of text describing typical human behaviors, encode substantial world knowledge, including probable sequences of human actions and activities. We demonstrate how these localized activity predictions can be incorporated in a human-aware task planner for an assistive robot to reduce the occurrences of undesirable human-robot interactions by 29.2% on average
    • …
    corecore