7 research outputs found

    INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

    Full text link
    This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.Comment: 12 pages, full versio

    ?????? ?????? ????????? ?????? ?????? ?????? ????????? ?????? ????????? ?????? ?????? ????????????

    Get PDF
    Department of Electrical EngineeringRotation invariance has been an important topic in computer vision tasks such as face detection [1], texture classification [2] and character recognition [3], to name a few. The importance of rotation invariant properties for computer vision methods still remains for recent DNN based approaches. In general, DNNs often require a lot more parameters with data augmentation with rotations to yield rotational-invariant outputs. Max pooling helps alleviating this issue, but since it is usually 2 2 [4], it is only for images rotated with very small angles. Recently, there have been some works on rotation-invariant neural network such as rotating weights [5, 6], enlarged receptive field using dialed convolutional neural network (CNN) [7] or a pyramid pooling layer [8], rotation region proposals for recognizing arbitrarily placed texts [9] and polar transform network to extract rotation-invariant features [10]. Applications of deep neural network based object and grasp detections could be expanded, significantly when the network output is processed by a high-level reasoning over relationship of objects. Recently, robotic grasp detection and object detection with reasoning have been investigated using deep neural networks (DNNs). There have been effects to combine these multi-tasks using separate networks so that robots can deal with situations of grasping specific target objects in the cluttered, stacked, complex piles of novel objects from a single RGB-D camera. We propose a single multi-task DNN that yields an accurate detections of objects, grasp position and relationship reasoning among objects. Our proposed methods yield state-of-the-art performance with the accuracy of 98.6%and 74.2% with the computation speed of 33 and 62 frame per second on VMRD and Cornell datasets, respectively. Our methods also yielded 95.3% grasp success rate for novel object grasping tasks with a 4-axis robot arm and 86.7% grasp success rate in cluttered novel objects with a humanoid robotope

    Semantic Scene Understanding for Prediction of Action Effects in Humanoid Robot Manipulation Tasks

    Get PDF

    Investigating Scene Understanding for Robotic Grasping: From Pose Estimation to Explainable AI

    Get PDF
    In the rapidly evolving field of robotics, the ability to accurately grasp and manipulate objects—known as robotic grasping—is a cornerstone of autonomous operation. This capability is pivotal across a multitude of applications, from industrial manufacturing automation to supply chain management, and is a key determinant of a robot's ability to interact effectively with its environment. Central to this capability is the concept of scene understanding, a complex task that involves interpreting the robot's environment to facilitate decision-making and action planning. This thesis presents a comprehensive exploration of scene understanding for robotic grasping, with a particular emphasis on pose estimation, a critical aspect of scene understanding. Pose estimation, the process of determining the position and orientation of objects within the robot's environment, is a crucial component of robotic grasping. It provides the robot with the necessary spatial information about the objects in the scene, enabling it to plan and execute grasping actions effectively. However, many current pose estimation methods provide relative pose compared to a 3D model, which lacks descriptiveness without referencing the 3D model. This thesis explores the use of keypoints and superquadrics as more general and descriptive representations of an object's pose. These novel approaches address the limitations of traditional methods and significantly enhance the generalizability and descriptiveness of pose estimation, thereby improving the overall effectiveness of robotic grasping. In addition to pose estimation, this thesis briefly touches upon the importance of uncertainty estimation and explainable AI in the context of robotic grasping. It introduces the concept of multimodal consistency for uncertainty estimation, providing a reliable measure of uncertainty that can enhance decision-making in human-in-the-loop situations. Furthermore, it explores the realm of explainable AI, presenting a method for gaining deeper insights into deep learning models, thereby enhancing their transparency and interpretability. In summary, this thesis presents a comprehensive approach to scene understanding for robotic grasping, with a particular emphasis on pose estimation. It addresses key challenges and advances the state of the art in this critical area of robotics research. The research is structured around five published papers, each contributing to a unique aspect of the overall study
    corecore