64 research outputs found

    Real-Time Grasp Detection Using Convolutional Neural Networks

    Full text link
    We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can simultaneously perform classification so that in a single step it recognizes the object and finds a good grasp rectangle. A modification to this model predicts multiple grasps per object by using a locally constrained prediction mechanism. The locally constrained model performs significantly better, especially on objects that can be grasped in a variety of ways.Comment: Accepted to ICRA 201

    Geometry-Based Next Frame Prediction from Monocular Video

    Full text link
    We consider the problem of next frame prediction from video input. A recurrent convolutional neural network is trained to predict depth from monocular video input, which, along with the current video image and the camera trajectory, can then be used to compute the next frame. Unlike prior next-frame prediction approaches, we take advantage of the scene geometry and use the predicted depth for generating the next frame prediction. Our approach can produce rich next frame predictions which include depth information attached to each pixel. Another novel aspect of our approach is that it predicts depth from a sequence of images (e.g. in a video), rather than from a single still image. We evaluate the proposed approach on the KITTI dataset, a standard dataset for benchmarking tasks relevant to autonomous driving. The proposed method produces results which are visually and numerically superior to existing methods that directly predict the next frame. We show that the accuracy of depth prediction improves as more prior frames are considered.Comment: To appear in 2017 IEEE Intelligent Vehicles Symposiu

    Visual Prediction of Rover Slip: Learning Algorithms and Field Experiments

    Get PDF
    Perception of the surrounding environment is an essential tool for intelligent navigation in any autonomous vehicle. In the context of Mars exploration, there is a strong motivation to enhance the perception of the rovers beyond geometry-based obstacle avoidance, so as to be able to predict potential interactions with the terrain. In this thesis we propose to remotely predict the amount of slip, which reflects the mobility of the vehicle on future terrain. The method is based on learning from experience and uses visual information from stereo imagery as input. We test the algorithm on several robot platforms and in different terrains. We also demonstrate its usefulness in an integrated system, onboard a Mars prototype rover in the JPL Mars Yard. Another desirable capability for an autonomous robot is to be able to learn about its interactions with the environment in a fully automatic fashion. We propose an algorithm which uses the robot's sensors as supervision for vision-based learning of different terrain types. This algorithm can work with noisy and ambiguous signals provided from onboard sensors. To be able to cope with rich, high-dimensional visual representations we propose a novel, nonlinear dimensionality reduction technique which exploits automatic supervision. The method is the first to consider supervised nonlinear dimensionality reduction in a probabilistic framework using supervision which can be noisy or ambiguous. Finally, we consider the problem of learning to recognize different terrains, which addresses the time constraints of an onboard autonomous system. We propose a method which automatically learns a variable-length feature representation depending on the complexity of the classification task. The proposed approach achieves a good trade-off between decrease in computational time and recognition performance.</p

    Joint Adaptive Representations for Image-Language Learning

    Full text link
    Image-language learning has made unprecedented progress in visual understanding. These developments have come at high costs, as contemporary vision-language models require large model scales and amounts of data. We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. This results in a more effective image-language learning, greatly lowering the FLOPs by combining and reducing the number of tokens for both text and images, e.g. a 33\% reduction in FLOPs is achieved, compared to baseline fusion techniques used by popular image-language models, while improving performance. This also allows the model to scale without a large increase in FLOPs or memory. In addition, we propose adaptive pre-training data sampling which improves the data efficiency. The proposed approach achieves competitive performance compared to much larger models, and does so with significantly less data and FLOPs. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.Comment: T4V Worksho

    Pruning training sets for learning of object categories

    Get PDF
    Training datasets for learning of object categories are often contaminated or imperfect. We explore an approach to automatically identify examples that are noisy or troublesome for learning and exclude them from the training set. The problem is relevant to learning in semi-supervised or unsupervised setting, as well as to learning when the training data is contaminated with wrongly labeled examples or when correctly labeled, but hard to learn examples, are present. We propose a fully automatic mechanism for noise cleaning, called ’data pruning’, and demonstrate its success on learning of human faces. It is not assumed that the data or the noise can be modeled or that additional training examples are available. Our experiments show that data pruning can improve on generalization performance for algorithms with various robustness to noise. It outperforms methods with regularization properties and is superior to commonly applied aggregation methods, such as bagging
    corecore