9,276 research outputs found

    Generic 3D Representation via Pose Estimation and Matching

    Full text link
    Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross-modality pose estimation). In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learned features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.Comment: Published in ECCV16. See the project website http://3drepresentation.stanford.edu/ and dataset website https://github.com/amir32002/3D_Street_Vie

    Deep Object-Centric Representations for Generalizable Robot Learning

    Full text link
    Robotic manipulation in complex open-world scenarios requires both reliable physical manipulation skills and effective and generalizable perception. In this paper, we propose a method where general purpose pretrained visual models serve as an object-centric prior for the perception system of a learned policy. We devise an object-level attentional mechanism that can be used to determine relevant objects from a few trajectories or demonstrations, and then immediately incorporate those objects into a learned policy. A task-independent meta-attention locates possible objects in the scene, and a task-specific attention identifies which objects are predictive of the trajectories. The scope of the task-specific attention is easily adjusted by showing demonstrations with distractor objects or with diverse relevant objects. Our results indicate that this approach exhibits good generalization across object instances using very few samples, and can be used to learn a variety of manipulation tasks using reinforcement learning

    Generalizable deep learning based medical image segmentation

    Get PDF
    Deep learning is revolutionizing medical image analysis and interpretation. However, its real-world deployment is often hindered by the poor generalization to unseen domains (new imaging modalities and protocols). This lack of generalization ability is further exacerbated by the scarcity of labeled datasets for training: Data collection and annotation can be prohibitively expensive in terms of labor and costs because label quality heavily dependents on the expertise of radiologists. Additionally, unreliable predictions caused by poor model generalization pose safety risks to clinical downstream applications. To mitigate labeling requirements, we investigate and develop a series of techniques to strengthen the generalization ability and the data efficiency of deep medical image computing models. We further improve model accountability and identify unreliable predictions made on out-of-domain data, by designing probability calibration techniques. In the first and the second part of thesis, we discuss two types of problems for handling unexpected domains: unsupervised domain adaptation and single-source domain generalization. For domain adaptation we present a data-efficient technique that adapts a segmentation model trained on a labeled source domain (e.g., MRI) to an unlabeled target domain (e.g., CT), using a small number of unlabeled training images from the target domain. For domain generalization, we focus on both image reconstruction and segmentation. For image reconstruction, we design a simple and effective domain generalization technique for cross-domain MRI reconstruction, by reusing image representations learned from natural image datasets. For image segmentation, we perform causal analysis of the challenging cross-domain image segmentation problem. Guided by this causal analysis we propose an effective data-augmentation-based generalization technique for single-source domains. The proposed method outperforms existing approaches on a large variety of cross-domain image segmentation scenarios. In the third part of the thesis, we present a novel self-supervised method for learning generic image representations that can be used to analyze unexpected objects of interest. The proposed method is designed together with a novel few-shot image segmentation framework that can segment unseen objects of interest by taking only a few labeled examples as references. Superior flexibility over conventional fully-supervised models is demonstrated by our few-shot framework: it does not require any fine-tuning on novel objects of interest. We further build a publicly available comprehensive evaluation environment for few-shot medical image segmentation. In the fourth part of the thesis, we present a novel probability calibration model. To ensure safety in clinical settings, a deep model is expected to be able to alert human radiologists if it has low confidence, especially when confronted with out-of-domain data. To this end we present a plug-and-play model to calibrate prediction probabilities on out-of-domain data. It aligns the prediction probability in line with the actual accuracy on the test data. We evaluate our method on both artifact-corrupted images and images from an unforeseen MRI scanning protocol. Our method demonstrates improved calibration accuracy compared with the state-of-the-art method. Finally, we summarize the major contributions and limitations of our works. We also suggest future research directions that will benefit from the works in this thesis.Open Acces

    A critical analysis of self-supervision, or what we can learn from a single image

    Full text link
    We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training. We conclude that: (1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that (2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that (3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset.Comment: Accepted paper at the International Conference on Learning Representations (ICLR) 202

    Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation

    Full text link
    Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this task, the robot learns a pixel-level inverse dynamics model of rope manipulation directly from images in a self-supervised manner, using about 60K interactions with the rope collected autonomously by the robot. The human demonstration provides a high-level plan of what to do and the low-level inverse model is used to execute the plan. We show that by combining the high and low-level plans, the robot can successfully manipulate a rope into a variety of target shapes using only a sequence of human-provided images for direction.Comment: 8 pages, accepted to International Conference on Robotics and Automation (ICRA) 201
    corecore