1,235 research outputs found

    Time-Contrastive Networks: Self-Supervised Learning from Video

    Full text link
    We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitat

    Sim2Real View Invariant Visual Servoing by Recurrent Control

    Full text link
    Humans are remarkably proficient at controlling their limbs and tools from a wide range of viewpoints and angles, even in the presence of optical distortions. In robotics, this ability is referred to as visual servoing: moving a tool or end-point to a desired location using primarily visual feedback. In this paper, we study how viewpoint-invariant visual servoing skills can be learned automatically in a robotic manipulation scenario. To this end, we train a deep recurrent controller that can automatically determine which actions move the end-point of a robotic arm to a desired object. The problem that must be solved by this controller is fundamentally ambiguous: under severe variation in viewpoint, it may be impossible to determine the actions in a single feedforward operation. Instead, our visual servoing system must use its memory of past movements to understand how the actions affect the robot motion from the current viewpoint, correcting mistakes and gradually moving closer to the target. This ability is in stark contrast to most visual servoing methods, which either assume known dynamics or require a calibration phase. We show how we can learn this recurrent controller using simulated data and a reinforcement learning objective. We then describe how the resulting model can be transferred to a real-world robot by disentangling perception from control and only adapting the visual layers. The adapted model can servo to previously unseen objects from novel viewpoints on a real-world Kuka IIWA robotic arm. For supplementary videos, see: https://fsadeghi.github.io/Sim2RealViewInvariantServoComment: Supplementary video: https://fsadeghi.github.io/Sim2RealViewInvariantServ

    Pseudo-Dolly-In Video Generation Combining 3D Modeling and Image Reconstruction

    Get PDF
    This paper proposes a pseudo-dolly-in video generation method that reproduces motion parallax by applying image reconstruction processing to multi-view videos. Since dolly-in video is taken by moving a camera forward to reproduce motion parallax, we can present a sense of immersion. However, at a sporting event in a large-scale space, moving a camera is difficult. Our research generates dolly-in video from multi-view images captured by fixed cameras. By applying the Image-Based Modeling technique, dolly-in video can be generated. Unfortunately, the video quality is often damaged by the 3D estimation error. On the other hand, Bullet-Time realizes high-quality video observation. However, moving the virtual-viewpoint from the capturing positions is difficult. To solve these problems, we propose a method to generate a pseudo-dolly-in image by installing 3D estimation and image reconstruction techniques into Bullet-Time and show its effectiveness by applying it to multi-view videos captured at an actual soccer stadium. In the experiment, we compared the proposed method with digital zoom images and with the dolly-in video generated from the Image-Based Modeling and Rendering method.Published in: 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct) Date of Conference: 9-13 Oct. 2017 Conference Location: Nantes, Franc

    Multi-Task Domain Adaptation for Deep Learning of Instance Grasping from Simulation

    Full text link
    Learning-based approaches to robotic manipulation are limited by the scalability of data collection and accessibility of labels. In this paper, we present a multi-task domain adaptation framework for instance grasping in cluttered scenes by utilizing simulated robot experiments. Our neural network takes monocular RGB images and the instance segmentation mask of a specified target object as inputs, and predicts the probability of successfully grasping the specified object for each candidate motor command. The proposed transfer learning framework trains a model for instance grasping in simulation and uses a domain-adversarial loss to transfer the trained model to real robots using indiscriminate grasping data, which is available both in simulation and the real world. We evaluate our model in real-world robot experiments, comparing it with alternative model architectures as well as an indiscriminate grasping baseline.Comment: ICRA 201
    corecore