62,924 research outputs found

    Deep Structured Models for Large Scale Object Co-detection and Segmentation

    Get PDF
    Structured decisions are often required for a large variety of image and scene understanding tasks in computer vision, with few of them being object detection, localization, semantic segmentation and many more. Structured prediction deals with learning inherent structure by incorporating contextual information from several images and multiple tasks. However, it is very challenging when dealing with large scale image datasets where performance is limited by high computational costs and expressive power of the underlying representation learning techniques. In this thesis, we present efficient and effective deep structured models for context-aware object detection, co-localization and instance-level semantic segmentation. First, we introduce a principled formulation for object co-detection using a fully-connected conditional random field (CRF). We build an explicit graph whose vertices represent object candidates (instead of pixel values) and edges encode the object similarity via simple, yet effective pairwise potentials. More specifically, we design a weighted mixture of Gaussian kernels for class-specific object similarity, and formulate kernel weights estimation as a least-squares regression problem. Its solution can therefore be obtained in closed-form. Furthermore, in contrast with traditional co-detection approaches, it has been shown that inference in such fully-connected CRFs can be performed efficiently using an approximate mean-field method with high-dimensional Gaussian filtering. This lets us effectively leverage information in multiple images. Next, we extend our class-specific co-detection framework to multiple object categories. We model object candidates with rich, high-dimensional features learned using a deep convolutional neural network. In particular, our max-margin and directloss structural boosting algorithms enable us to learn the most suitable features that best encode pairwise similarity relationships within our CRF framework. Furthermore, it guarantees that the time and space complexity is O(n t) where n is the total number of candidate boxes in the pool and t the number of mean-field iterations. Moreover, our experiments evidence the importance of learning rich similarity measures to account for the contextual relations across object classes and instances. However, all these methods are based on precomputed object candidates (or proposals), thus localization performance is limited by the quality of bounding-boxes. To address this, we present an efficient object proposal co-generation technique that leverages the collective power of multiple images. In particular, we design a deep neural network layer that takes unary and pairwise features as input, builds a fully-connected CRF and produces mean-field marginals as output. It also lets us backpropagate the gradient through entire network by unrolling the iterations of CRF inference. Furthermore, this layer simplifies the end-to-end learning, thus effectively benefiting from multiple candidates to co-generate high-quality object proposals. Finally, we develop a multi-task strategy to jointly learn object detection, localization and instance-level semantic segmentation in a single network. In particular, we introduce a novel representation based on the distance transform of the object masks. To this end, we design a new residual-deconvolution architecture that infers such a representation and decodes it into the final binary object mask. We show that the predicted masks can go beyond the scope of the bounding boxes and that the multiple tasks can benefit from each other. In summary, in this thesis, we exploit the joint power of multiple images as well as multiple tasks to improve generalization performance of structured learning. Our novel deep structured models, similarity learning techniques and residual-deconvolution architecture can be used to make accurate and reliable inference for key vision tasks. Furthermore, our quantitative and qualitative experiments on large scale challenging image datasets demonstrate the superiority of the proposed approaches over the state-of-the-art methods

    Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs

    Get PDF
    In this work we propose a structured prediction technique that combines the virtues of Gaussian Conditional Random Fields (G-CRF) with Deep Learning: (a) our structured prediction task has a unique global optimum that is obtained exactly from the solution of a linear system (b) the gradients of our model parameters are analytically computed using closed form expressions, in contrast to the memory-demanding contemporary deep structured prediction approaches that rely on back-propagation-through-time, (c) our pairwise terms do not have to be simple hand-crafted expressions, as in the line of works building on the DenseCRF, but can rather be `discovered' from data through deep architectures, and (d) out system can trained in an end-to-end manner. Building on standard tools from numerical analysis we develop very efficient algorithms for inference and learning, as well as a customized technique adapted to the semantic segmentation task. This efficiency allows us to explore more sophisticated architectures for structured prediction in deep learning: we introduce multi-resolution architectures to couple information across scales in a joint optimization framework, yielding systematic improvements. We demonstrate the utility of our approach on the challenging VOC PASCAL 2012 image segmentation benchmark, showing substantial improvements over strong baselines. We make all of our code and experiments available at {https://github.com/siddharthachandra/gcrf}Comment: Our code is available at https://github.com/siddharthachandra/gcr

    Exploring Context with Deep Structured models for Semantic Segmentation

    Full text link
    State-of-the-art semantic image segmentation methods are mostly based on training deep convolutional neural networks (CNNs). In this work, we proffer to improve semantic segmentation with the use of contextual information. In particular, we explore `patch-patch' context and `patch-background' context in deep CNNs. We formulate deep structured models by combining CNNs and Conditional Random Fields (CRFs) for learning the patch-patch context between image regions. Specifically, we formulate CNN-based pairwise potential functions to capture semantic correlations between neighboring patches. Efficient piecewise training of the proposed deep structured model is then applied in order to avoid repeated expensive CRF inference during the course of back propagation. For capturing the patch-background context, we show that a network design with traditional multi-scale image inputs and sliding pyramid pooling is very effective for improving performance. We perform comprehensive evaluation of the proposed method. We achieve new state-of-the-art performance on a number of challenging semantic segmentation datasets including NYUDv2NYUDv2, PASCALPASCAL-VOC2012VOC2012, CityscapesCityscapes, PASCALPASCAL-ContextContext, SUNSUN-RGBDRGBD, SIFTSIFT-flowflow, and KITTIKITTI datasets. Particularly, we report an intersection-over-union score of 77.877.8 on the PASCALPASCAL-VOC2012VOC2012 dataset.Comment: 16 pages. Accepted to IEEE T. Pattern Analysis & Machine Intelligence, 2017. Extended version of arXiv:1504.0101

    Deeply Learning the Messages in Message Passing Inference

    Full text link
    Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to estimate the messages in message passing inference for structured prediction with Conditional Random Fields (CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension for message estimation is the same as the number of classes, in contrast to the network output for general CNN potential functions in CRFs, which is exponential in the order of the potentials. Hence CNN message learning has fewer network parameters and is more scalable for cases that a large number of classes are involved. We apply our method to semantic image segmentation on the PASCAL VOC 2012 dataset. We achieve an intersection-over-union score of 73.4 on its test set, which is the best reported result for methods using the VOC training images alone. This impressive performance demonstrates the effectiveness and usefulness of our CNN message learning method.Comment: 11 pages. Appearing in Proc. The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015, Montreal, Canad

    Discriminative Training of Deep Fully-connected Continuous CRF with Task-specific Loss

    Full text link
    Recent works on deep conditional random fields (CRF) have set new records on many vision tasks involving structured predictions. Here we propose a fully-connected deep continuous CRF model for both discrete and continuous labelling problems. We exemplify the usefulness of the proposed model on multi-class semantic labelling (discrete) and the robust depth estimation (continuous) problems. In our framework, we model both the unary and the pairwise potential functions as deep convolutional neural networks (CNN), which are jointly learned in an end-to-end fashion. The proposed method possesses the main advantage of continuously-valued CRF, which is a closed-form solution for the Maximum a posteriori (MAP) inference. To better adapt to different tasks, instead of using the commonly employed maximum likelihood CRF parameter learning protocol, we propose task-specific loss functions for learning the CRF parameters. It enables direct optimization of the quality of the MAP estimates during the course of learning. Specifically, we optimize the multi-class classification loss for the semantic labelling task and the Turkey's biweight loss for the robust depth estimation problem. Experimental results on the semantic labelling and robust depth estimation tasks demonstrate that the proposed method compare favorably against both baseline and state-of-the-art methods. In particular, we show that although the proposed deep CRF model is continuously valued, with the equipment of task-specific loss, it achieves impressive results even on discrete labelling tasks
    • …
    corecore