64 research outputs found
Real-Time Grasp Detection Using Convolutional Neural Networks
We present an accurate, real-time approach to robotic grasp detection based
on convolutional neural networks. Our network performs single-stage regression
to graspable bounding boxes without using standard sliding window or region
proposal techniques. The model outperforms state-of-the-art approaches by 14
percentage points and runs at 13 frames per second on a GPU. Our network can
simultaneously perform classification so that in a single step it recognizes
the object and finds a good grasp rectangle. A modification to this model
predicts multiple grasps per object by using a locally constrained prediction
mechanism. The locally constrained model performs significantly better,
especially on objects that can be grasped in a variety of ways.Comment: Accepted to ICRA 201
Geometry-Based Next Frame Prediction from Monocular Video
We consider the problem of next frame prediction from video input. A
recurrent convolutional neural network is trained to predict depth from
monocular video input, which, along with the current video image and the camera
trajectory, can then be used to compute the next frame. Unlike prior next-frame
prediction approaches, we take advantage of the scene geometry and use the
predicted depth for generating the next frame prediction. Our approach can
produce rich next frame predictions which include depth information attached to
each pixel. Another novel aspect of our approach is that it predicts depth from
a sequence of images (e.g. in a video), rather than from a single still image.
We evaluate the proposed approach on the KITTI dataset, a standard dataset for
benchmarking tasks relevant to autonomous driving. The proposed method produces
results which are visually and numerically superior to existing methods that
directly predict the next frame. We show that the accuracy of depth prediction
improves as more prior frames are considered.Comment: To appear in 2017 IEEE Intelligent Vehicles Symposiu
Visual Prediction of Rover Slip: Learning Algorithms and Field Experiments
Perception of the surrounding environment is an essential tool for intelligent navigation in any autonomous vehicle. In the context of Mars exploration, there is a strong motivation to enhance the perception of the rovers beyond geometry-based obstacle avoidance, so as to be able to predict potential interactions with the terrain. In this thesis we propose to remotely predict the amount of slip, which reflects the mobility of the vehicle on future terrain. The method is based on learning from experience and uses visual information from stereo imagery as input. We test the algorithm on several robot platforms and in different terrains. We also demonstrate its usefulness in an integrated system, onboard a Mars prototype rover in the JPL Mars Yard.
Another desirable capability for an autonomous robot is to be able to learn about its interactions with the environment in a fully automatic fashion. We propose an algorithm which uses the robot's sensors as supervision for vision-based learning of different terrain types. This algorithm can work with noisy and ambiguous signals provided from onboard sensors. To be able to cope with rich, high-dimensional visual representations we propose a novel, nonlinear dimensionality reduction technique which exploits automatic supervision. The method is the first to consider supervised nonlinear dimensionality reduction in a probabilistic framework using supervision which can be noisy or ambiguous.
Finally, we consider the problem of learning to recognize different terrains, which addresses the time constraints of an onboard autonomous system. We propose a method which automatically learns a variable-length feature representation depending on the complexity of the classification task. The proposed approach achieves a good trade-off between decrease in computational time and recognition performance.</p
Joint Adaptive Representations for Image-Language Learning
Image-language learning has made unprecedented progress in visual
understanding. These developments have come at high costs, as contemporary
vision-language models require large model scales and amounts of data. We here
propose a much easier recipe for image-language learning, which produces
effective models, outperforming bigger and more expensive ones, often trained
on orders of magnitude larger datasets. Our key finding is the joint learning
of a compact vision and language representation, which adaptively and
iteratively fuses the multi-modal features. This results in a more effective
image-language learning, greatly lowering the FLOPs by combining and reducing
the number of tokens for both text and images, e.g. a 33\% reduction in FLOPs
is achieved, compared to baseline fusion techniques used by popular
image-language models, while improving performance. This also allows the model
to scale without a large increase in FLOPs or memory. In addition, we propose
adaptive pre-training data sampling which improves the data efficiency. The
proposed approach achieves competitive performance compared to much larger
models, and does so with significantly less data and FLOPs. With only 40M
training examples and with 39 GFLOPs our lightweight model outperforms many
times larger state-of-the-art models of 2-20x more FLOPs and using bigger
datasets some of which with close to 1B training examples.Comment: T4V Worksho
Pruning training sets for learning of object categories
Training datasets for learning of object categories are often contaminated or imperfect. We explore an approach to automatically identify examples that are noisy or troublesome for learning and exclude them from the training set. The problem is relevant to learning in semi-supervised or unsupervised setting, as well as to learning when the training data is contaminated with wrongly labeled examples or when correctly labeled, but hard to learn examples, are present. We propose a fully automatic mechanism for noise cleaning, called ’data pruning’, and demonstrate its success on learning of human faces. It is not assumed that the data or the noise can be modeled or that additional training examples are available. Our experiments show that data pruning can improve on generalization performance for algorithms with various robustness to noise. It outperforms methods with regularization properties and is superior to commonly applied aggregation methods, such as bagging
- …