9,276 research outputs found
Generic 3D Representation via Pose Estimation and Matching
Though a large body of computer vision research has investigated developing
generic semantic representations, efforts towards developing a similar
representation for 3D has been limited. In this paper, we learn a generic 3D
representation through solving a set of foundational proxy 3D tasks:
object-centric camera pose estimation and wide baseline feature matching. Our
method is based upon the premise that by providing supervision over a set of
carefully selected foundational tasks, generalization to novel tasks and
abstraction capabilities can be achieved. We empirically show that the internal
representation of a multi-task ConvNet trained to solve the above core problems
generalizes to novel 3D tasks (e.g., scene layout estimation, object pose
estimation, surface normal estimation) without the need for fine-tuning and
shows traits of abstraction abilities (e.g., cross-modality pose estimation).
In the context of the core supervised tasks, we demonstrate our representation
achieves state-of-the-art wide baseline feature matching results without
requiring apriori rectification (unlike SIFT and the majority of learned
features). We also show 6DOF camera pose estimation given a pair local image
patches. The accuracy of both supervised tasks come comparable to humans.
Finally, we contribute a large-scale dataset composed of object-centric street
view scenes along with point correspondences and camera pose information, and
conclude with a discussion on the learned representation and open research
questions.Comment: Published in ECCV16. See the project website
http://3drepresentation.stanford.edu/ and dataset website
https://github.com/amir32002/3D_Street_Vie
Deep Object-Centric Representations for Generalizable Robot Learning
Robotic manipulation in complex open-world scenarios requires both reliable
physical manipulation skills and effective and generalizable perception. In
this paper, we propose a method where general purpose pretrained visual models
serve as an object-centric prior for the perception system of a learned policy.
We devise an object-level attentional mechanism that can be used to determine
relevant objects from a few trajectories or demonstrations, and then
immediately incorporate those objects into a learned policy. A task-independent
meta-attention locates possible objects in the scene, and a task-specific
attention identifies which objects are predictive of the trajectories. The
scope of the task-specific attention is easily adjusted by showing
demonstrations with distractor objects or with diverse relevant objects. Our
results indicate that this approach exhibits good generalization across object
instances using very few samples, and can be used to learn a variety of
manipulation tasks using reinforcement learning
Generalizable deep learning based medical image segmentation
Deep learning is revolutionizing medical image analysis and interpretation. However, its real-world deployment is often hindered by the poor generalization to unseen domains (new imaging modalities and protocols). This lack of generalization ability is further exacerbated by the scarcity of labeled datasets for training: Data collection and annotation can be prohibitively expensive in terms of labor and costs because label quality heavily dependents on the expertise of radiologists. Additionally, unreliable predictions caused by poor model generalization pose safety risks to clinical downstream applications.
To mitigate labeling requirements, we investigate and develop a series of techniques to strengthen the generalization ability and the data efficiency of deep medical image computing models. We further improve model accountability and identify unreliable predictions made on out-of-domain data, by designing probability calibration techniques.
In the first and the second part of thesis, we discuss two types of problems for handling unexpected domains: unsupervised domain adaptation and single-source domain generalization. For domain adaptation we present a data-efficient technique that adapts a segmentation model trained on a labeled source domain (e.g., MRI) to an unlabeled target domain (e.g., CT), using a small number of unlabeled training images from the target domain.
For domain generalization, we focus on both image reconstruction and segmentation. For image reconstruction, we design a simple and effective domain generalization technique for cross-domain MRI reconstruction, by reusing image representations learned from natural image datasets. For image segmentation, we perform causal analysis of the challenging cross-domain image segmentation problem. Guided by this causal analysis we propose an effective data-augmentation-based generalization technique for single-source domains. The proposed method outperforms existing approaches on a large variety of cross-domain image segmentation scenarios.
In the third part of the thesis, we present a novel self-supervised method for learning generic image representations that can be used to analyze unexpected objects of interest. The proposed method is designed together with a novel few-shot image segmentation framework that can segment unseen objects of interest by taking only a few labeled examples as references. Superior flexibility over conventional fully-supervised models is demonstrated by our few-shot framework: it does not require any fine-tuning on novel objects of interest. We further build a publicly available comprehensive evaluation environment for few-shot medical image segmentation.
In the fourth part of the thesis, we present a novel probability calibration model. To ensure safety in clinical settings, a deep model is expected to be able to alert human radiologists if it has low confidence, especially when confronted with out-of-domain data. To this end we present a plug-and-play model to calibrate prediction probabilities on out-of-domain data. It aligns the prediction probability in line with the actual accuracy on the test data. We evaluate our method on both artifact-corrupted images and images from an unforeseen MRI scanning protocol. Our method demonstrates improved calibration accuracy compared with the state-of-the-art method.
Finally, we summarize the major contributions and limitations of our works. We also suggest future research directions that will benefit from the works in this thesis.Open Acces
A critical analysis of self-supervision, or what we can learn from a single image
We look critically at popular self-supervision techniques for learning deep
convolutional neural networks without manual labels. We show that three
different and representative methods, BiGAN, RotNet and DeepCluster, can learn
the first few layers of a convolutional network from a single image as well as
using millions of images and manual labels, provided that strong data
augmentation is used. However, for deeper layers the gap with manual
supervision cannot be closed even if millions of unlabelled images are used for
training. We conclude that: (1) the weights of the early layers of deep
networks contain limited information about the statistics of natural images,
that (2) such low-level statistics can be learned through self-supervision just
as well as through strong supervision, and that (3) the low-level statistics
can be captured via synthetic transformations instead of using a large image
dataset.Comment: Accepted paper at the International Conference on Learning
Representations (ICLR) 202
Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation
Manipulation of deformable objects, such as ropes and cloth, is an important
but challenging problem in robotics. We present a learning-based system where a
robot takes as input a sequence of images of a human manipulating a rope from
an initial to goal configuration, and outputs a sequence of actions that can
reproduce the human demonstration, using only monocular images as input. To
perform this task, the robot learns a pixel-level inverse dynamics model of
rope manipulation directly from images in a self-supervised manner, using about
60K interactions with the rope collected autonomously by the robot. The human
demonstration provides a high-level plan of what to do and the low-level
inverse model is used to execute the plan. We show that by combining the high
and low-level plans, the robot can successfully manipulate a rope into a
variety of target shapes using only a sequence of human-provided images for
direction.Comment: 8 pages, accepted to International Conference on Robotics and
Automation (ICRA) 201
- …