32,428 research outputs found

    KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real Transfer for Robotics Manipulation

    Full text link
    We present KOVIS, a novel learning-based, calibration-free visual servoing method for fine robotic manipulation tasks with eye-in-hand stereo camera system. We train the deep neural network only in the simulated environment; and the trained model could be directly used for real-world visual servoing tasks. KOVIS consists of two networks. The first keypoint network learns the keypoint representation from the image using with an autoencoder. Then the visual servoing network learns the motion based on keypoints extracted from the camera image. The two networks are trained end-to-end in the simulated environment by self-supervised learning without manual data labeling. After training with data augmentation, domain randomization, and adversarial examples, we are able to achieve zero-shot sim-to-real transfer to real-world robotic manipulation tasks. We demonstrate the effectiveness of the proposed method in both simulated environment and real-world experiment with different robotic manipulation tasks, including grasping, peg-in-hole insertion with 4mm clearance, and M13 screw insertion. The demo video is available at http://youtu.be/gfBJBR2tDzAComment: Accepted by IROS 202

    Manifestation of Image Contrast in Deep Networks

    Full text link
    Contrast is subject to dramatic changes across the visual field, depending on the source of light and scene configurations. Hence, the human visual system has evolved to be more sensitive to contrast than absolute luminance. This feature is equally desired for machine vision: the ability to recognise patterns even when aspects of them are transformed due to variation in local and global contrast. In this work, we thoroughly investigate the impact of image contrast on prominent deep convolutional networks, both during the training and testing phase. The results of conducted experiments testify to an evident deterioration in the accuracy of all state-of-the-art networks at low-contrast images. We demonstrate that "contrast-augmentation" is a sufficient condition to endow a network with invariance to contrast. This practice shows no negative side effects, quite the contrary, it might allow a model to refrain from other illuminance related over-fittings. This ability can also be achieved by a short fine-tuning procedure, which opens new lines of investigation on mechanisms involved in two networks whose weights are over 99.9% correlated, yet astonishingly produce utterly different outcomes. Our further analysis suggests that the optimisation algorithm is an influential factor, however with a significantly lower effect; and while the choice of an architecture manifests a negligible impact on this phenomenon, the first layers appear to be more critical

    A Causal View on Robustness of Neural Networks

    Full text link
    We present a causal view on the robustness of neural networks against input manipulations, which applies not only to traditional classification tasks but also to general measurement data. Based on this view, we design a deep causal manipulation augmented model (deep CAMA) which explicitly models possible manipulations on certain causes leading to changes in the observed effect. We further develop data augmentation and test-time fine-tuning methods to improve deep CAMA's robustness. When compared with discriminative deep neural networks, our proposed model shows superior robustness against unseen manipulations. As a by-product, our model achieves disentangled representation which separates the representation of manipulations from those of other latent causes.Comment: NeurIPS 202

    Fine-Tuning VGG Neural Network For Fine-grained State Recognition of Food Images

    Full text link
    State recognition of food images can be considered as one of the promising applications of object recognition and fine-grained image classification in computer vision. In this paper, evidence is provided for the power of convolutional neural network (CNN) for food state recognition, even with a small data set. In this study, we fine-tuned a CNN initially trained on a large natural image recognition dataset (Imagenet ILSVRC) and transferred the learned feature representations to the food state recognition task. A small-scale dataset consisting of 5978 images of seven categories was constructed and annotated manually. Data augmentation was applied to increase the size of the data.Comment: 5 pages, 7 figure

    RGB-D Object Detection and Semantic Segmentation for Autonomous Manipulation in Clutter

    Full text link
    Autonomous robotic manipulation in clutter is challenging. A large variety of objects must be perceived in complex scenes, where they are partially occluded and embedded among many distractors, often in restricted spaces. To tackle these challenges, we developed a deep-learning approach that combines object detection and semantic segmentation. The manipulation scenes are captured with RGB-D cameras, for which we developed a depth fusion method. Employing pretrained features makes learning from small annotated robotic data sets possible. We evaluate our approach on two challenging data sets: one captured for the Amazon Picking Challenge 2016, where our team NimbRo came in second in the Stowing and third in the Picking task, and one captured in disaster-response scenarios. The experiments show that object detection and semantic segmentation complement each other and can be combined to yield reliable object perception

    Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories

    Full text link
    A robot operating in a real-world environment needs to perform reasoning over a variety of sensor modalities such as vision, language and motion trajectories. However, it is extremely challenging to manually design features relating such disparate modalities. In this work, we introduce an algorithm that learns to embed point-cloud, natural language, and manipulation trajectory data into a shared embedding space with a deep neural network. To learn semantically meaningful spaces throughout our network, we use a loss-based margin to bring embeddings of relevant pairs closer together while driving less-relevant cases from different modalities further apart. We use this both to pre-train its lower layers and fine-tune our final embedding space, leading to a more robust representation. We test our algorithm on the task of manipulating novel objects and appliances based on prior experience with other objects. On a large dataset, we achieve significant improvements in both accuracy and inference time over the previous state of the art. We also perform end-to-end experiments on a PR2 robot utilizing our learned embedding space.Comment: IEEE International Conference on Robotics and Automation (ICRA), 201

    Bayesian Grasp: Robotic visual stable grasp based on prior tactile knowledge

    Full text link
    Robotic grasp detection is a fundamental capability for intelligent manipulation in unstructured environments. Previous work mainly employed visual and tactile fusion to achieve stable grasp, while, the whole process depending heavily on regrasping, which wastes much time to regulate and evaluate. We propose a novel way to improve robotic grasping: by using learned tactile knowledge, a robot can achieve a stable grasp from an image. First, we construct a prior tactile knowledge learning framework with novel grasp quality metric which is determined by measuring its resistance to external perturbations. Second, we propose a multi-phases Bayesian Grasp architecture to generate stable grasp configurations through a single RGB image based on prior tactile knowledge. Results show that this framework can classify the outcome of grasps with an average accuracy of 86% on known objects and 79% on novel objects. The prior tactile knowledge improves the successful rate of 55% over traditional vision-based strategies.Comment: submitted to ICRA202

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Full text link
    Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.Comment: Examples available at https://goo.gl/magenta/maestro-example

    Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding

    Full text link
    There is a large variety of objects and appliances in human environments, such as stoves, coffee dispensers, juice extractors, and so on. It is challenging for a roboticist to program a robot for each of these object types and for each of their instantiations. In this work, we present a novel approach to manipulation planning based on the idea that many household objects share similarly-operated object parts. We formulate the manipulation planning as a structured prediction problem and learn to transfer manipulation strategy across different objects by embedding point-cloud, natural language, and manipulation trajectory data into a shared embedding space using a deep neural network. In order to learn semantically meaningful spaces throughout our network, we introduce a method for pre-training its lower layers for multimodal feature embedding and a method for fine-tuning this embedding space using a loss-based margin. In order to collect a large number of manipulation demonstrations for different objects, we develop a new crowd-sourcing platform called Robobarista. We test our model on our dataset consisting of 116 objects and appliances with 249 parts along with 250 language instructions, for which there are 1225 crowd-sourced manipulation demonstrations. We further show that our robot with our model can even prepare a cup of a latte with appliances it has never seen before.Comment: Journal Versio

    Decompose to manipulate: Manipulable Object Synthesis in 3D Medical Images with Structured Image Decomposition

    Full text link
    The performance of medical image analysis systems is constrained by the quantity of high-quality image annotations. Such systems require data to be annotated by experts with years of training, especially when diagnostic decisions are involved. Such datasets are thus hard to scale up. In this context, it is hard for supervised learning systems to generalize to the cases that are rare in the training set but would be present in real-world clinical practices. We believe that the synthetic image samples generated by a system trained on the real data can be useful for improving the supervised learning tasks in the medical image analysis applications. Allowing the image synthesis to be manipulable could help synthetic images provide complementary information to the training data rather than simply duplicating the real-data manifold. In this paper, we propose a framework for synthesizing 3D objects, such as pulmonary nodules, in 3D medical images with manipulable properties. The manipulation is enabled by decomposing of the object of interests into its segmentation mask and a 1D vector containing the residual information. The synthetic object is refined and blended into the image context with two adversarial discriminators. We evaluate the proposed framework on lung nodules in 3D chest CT images and show that the proposed framework could generate realistic nodules with manipulable shapes, textures and locations, etc. By sampling from both the synthetic nodules and the real nodules from 2800 3D CT volumes during the classifier training, we show the synthetic patches could improve the overall nodule detection performance by average 8.44% competition performance metric (CPM) score
    • …
    corecore