80 research outputs found

    Deep Learning for Semantic Part Segmentation with High-Level Guidance

    Full text link
    In this work we address the task of segmenting an object into its parts, or semantic part segmentation. We start by adapting a state-of-the-art semantic segmentation system to this task, and show that a combination of a fully-convolutional Deep CNN system coupled with Dense CRF labelling provides excellent results for a broad range of object categories. Still, this approach remains agnostic to high-level constraints between object parts. We introduce such prior information by means of the Restricted Boltzmann Machine, adapted to our task and train our model in an discriminative fashion, as a hidden CRF, demonstrating that prior information can yield additional improvements. We also investigate the performance of our approach ``in the wild'', without information concerning the objects' bounding boxes, using an object detector to guide a multi-scale segmentation scheme. We evaluate the performance of our approach on the Penn-Fudan and LFW datasets for the tasks of pedestrian parsing and face labelling respectively. We show superior performance with respect to competitive methods that have been extensively engineered on these benchmarks, as well as realistic qualitative results on part segmentation, even for occluded or deformable objects. We also provide quantitative and extensive qualitative results on three classes from the PASCAL Parts dataset. Finally, we show that our multi-scale segmentation scheme can boost accuracy, recovering segmentations for finer parts.Comment: 11 pages (including references), 3 figures, 2 table

    Multi-Class Semantic Segmentation of Faces

    Get PDF
    In this paper the problem of multi-class face segmentation is introduced. Differently from previous works which only consider few classes - typically skin and hair - the label set is extended here to six categories: skin, hair, eyes, nose, mouth and background. A dataset with 70 images taken from MIT-CBCL and FEI face databases is manually annotated and made publicly available1. Three kind of local features - accounting for color, shape and location - are extracted from uniformly sampled square patches. A discriminative model is built with random decision forests and used for classification. Many different combinations of features and parameters are explored to find the best possible model configuration. Our analysis shows that very good performance (~ 93% in accuracy) can be achieved with a fairly simple model

    Improving Deep Representation Learning with Complex and Multimodal Data.

    Full text link
    Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts. The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation. The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd

    A CNN cascade for landmark guided semantic part segmentation

    Get PDF
    This paper proposes a CNN cascade for semantic part segmentation guided by pose-specifc information encoded in terms of a set of landmarks (or keypoints). There is large amount of prior work on each of these tasks separately, yet, to the best of our knowledge, this is the first time in literature that the interplay between pose estimation and semantic part segmentation is investigated. To address this limitation of prior work, in this paper, we propose a CNN cascade of tasks that firstly performs landmark localisation and then uses this information as input for guiding semantic part segmentation. We applied our architecture to the problem of facial part segmentation and report large performance improvement over the standard unguided network on the most challenging face datasets. Testing code and models will be published online at http://cs.nott.ac.uk/~psxasj/

    Dynamic Face Video Segmentation via Reinforcement Learning

    Full text link
    For real-time semantic video segmentation, most recent works utilised a dynamic framework with a key scheduler to make online key/non-key decisions. Some works used a fixed key scheduling policy, while others proposed adaptive key scheduling methods based on heuristic strategies, both of which may lead to suboptimal global performance. To overcome this limitation, we model the online key decision process in dynamic video segmentation as a deep reinforcement learning problem and learn an efficient and effective scheduling policy from expert information about decision history and from the process of maximising global return. Moreover, we study the application of dynamic video segmentation on face videos, a field that has not been investigated before. By evaluating on the 300VW dataset, we show that the performance of our reinforcement key scheduler outperforms that of various baselines in terms of both effective key selections and running speed. Further results on the Cityscapes dataset demonstrate that our proposed method can also generalise to other scenarios. To the best of our knowledge, this is the first work to use reinforcement learning for online key-frame decision in dynamic video segmentation, and also the first work on its application on face videos.Comment: CVPR 2020. 300VW with segmentation labels is available at: https://github.com/mapleandfire/300VW-Mas

    Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

    Get PDF
    We frame the task of predicting a semantic labeling as a sparse reconstruction procedure that applies a target-specific learned transfer function to a generic deep sparse code representation of an image. This strategy partitions training into two distinct stages. First, in an unsupervised manner, we learn a set of generic dictionaries optimized for sparse coding of image patches. We train a multilayer representation via recursive sparse dictionary learning on pooled codes output by earlier layers. Second, we encode all training images with the generic dictionaries and learn a transfer function that optimizes reconstruction of patches extracted from annotated ground-truth given the sparse codes of their corresponding image patches. At test time, we encode a novel image using the generic dictionaries and then reconstruct using the transfer function. The output reconstruction is a semantic labeling of the test image. Applying this strategy to the task of contour detection, we demonstrate performance competitive with state-of-the-art systems. Unlike almost all prior work, our approach obviates the need for any form of hand-designed features or filters. To illustrate general applicability, we also show initial results on semantic part labeling of human faces. The effectiveness of our approach opens new avenues for research on deep sparse representations. Our classifiers utilize this representation in a novel manner. Rather than acting on nodes in the deepest layer, they attach to nodes along a slice through multiple layers of the network in order to make predictions about local patches. Our flexible combination of a generatively learned sparse representation with discriminatively trained transfer classifiers extends the notion of sparse reconstruction to encompass arbitrary semantic labeling tasks.Comment: to appear in Asian Conference on Computer Vision (ACCV), 201
    • …
    corecore