43,868 research outputs found

    Spatiotemporal CNN for Video Object Segmentation

    Full text link
    In this paper, we present a unified, end-to-end trainable spatiotemporal CNN model for VOS, which consists of two branches, i.e., the temporal coherence branch and the spatial segmentation branch. Specifically, the temporal coherence branch pretrained in an adversarial fashion from unlabeled video data, is designed to capture the dynamic appearance and motion cues of video sequences to guide object segmentation. The spatial segmentation branch focuses on segmenting objects accurately based on the learned appearance and motion cues. To obtain accurate segmentation results, we design a coarse-to-fine process to sequentially apply a designed attention module on multi-scale feature maps, and concatenate them to produce the final prediction. In this way, the spatial segmentation branch is enforced to gradually concentrate on object regions. These two branches are jointly fine-tuned on video segmentation sequences in an end-to-end manner. Several experiments are carried out on three challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show that our method achieves favorable performance against the state-of-the-arts. Code is available at https://github.com/longyin880815/STCNN.Comment: 10 pages, 3 figures, 6 tables, CVPR 201

    DADA: Depth-aware Domain Adaptation in Semantic Segmentation

    Full text link
    Unsupervised domain adaptation (UDA) is important for applications where large scale annotation of representative data is challenging. For semantic segmentation in particular, it helps deploy on real "target domain" data models that are trained on annotated images from a different "source domain", notably a virtual environment. To this end, most previous works consider semantic segmentation as the only mode of supervision for source domain data, while ignoring other, possibly available, information like depth. In this work, we aim at exploiting at best such a privileged information while training the UDA model. We propose a unified depth-aware UDA framework that leverages in several complementary ways the knowledge of dense depth in the source domain. As a result, the performance of the trained semantic segmentation model on the target domain is boosted. Our novel approach indeed achieves state-of-the-art performance on different challenging synthetic-2-real benchmarks.Comment: Accepted in ICCV'1

    An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos

    Full text link
    In this paper, we propose an end-to-end 3D CNN for action detection and segmentation in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D CNN features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals. This top-down action detection approach explicitly relies on a set of good tube proposals to perform well and training the bounding box regression usually requires a large number of annotated samples. To remedy this, we further extend the 3D CNN to an encoder-decoder structure and formulate the localization problem as action segmentation. The foreground regions (i.e. action regions) for each frame are segmented first then the segmented foreground maps are used to generate the bounding boxes. This bottom-up approach effectively avoids tube proposal generation by leveraging the pixel-wise annotations of segmentation. The segmentation framework also can be readily applied to a general problem of video object segmentation. Extensive experiments on several video datasets demonstrate the superior performance of our approach for action detection and video object segmentation compared to the state-of-the-arts.Comment: arXiv admin note: substantial text overlap with arXiv:1703.1066

    Dominant Sets for "Constrained" Image Segmentation

    Full text link
    Image segmentation has come a long way since the early days of computer vision, and still remains a challenging task. Modern variations of the classical (purely bottom-up) approach, involve, e.g., some form of user assistance (interactive segmentation) or ask for the simultaneous segmentation of two or more images (co-segmentation). At an abstract level, all these variants can be thought of as "constrained" versions of the original formulation, whereby the segmentation process is guided by some external source of information. In this paper, we propose a new approach to tackle this kind of problems in a unified way. Our work is based on some properties of a family of quadratic optimization problems related to dominant sets, a well-known graph-theoretic notion of a cluster which generalizes the concept of a maximal clique to edge-weighted graphs. In particular, we show that by properly controlling a regularization parameter which determines the structure and the scale of the underlying problem, we are in a position to extract groups of dominant-set clusters that are constrained to contain predefined elements. In particular, we shall focus on interactive segmentation and co-segmentation (in both the unsupervised and the interactive versions). The proposed algorithm can deal naturally with several type of constraints and input modality, including scribbles, sloppy contours, and bounding boxes, and is able to robustly handle noisy annotations on the part of the user. Experiments on standard benchmark datasets show the effectiveness of our approach as compared to state-of-the-art algorithms on a variety of natural images under several input conditions and constraints.Comment: arXiv admin note: text overlap with arXiv:1608.0064

    A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation

    Full text link
    This paper presents a novel framework for simultaneously implementing localization and segmentation, which are two of the most important vision-based tasks for robotics. While the goals and techniques used for them were considered to be different previously, we show that by making use of the intermediate results of the two modules, their performance can be enhanced at the same time. Our framework is able to handle both the instantaneous motion and long-term changes of instances in localization with the help of the segmentation result, which also benefits from the refined 3D pose information. We conduct experiments on various datasets, and prove that our framework works effectively on improving the precision and robustness of the two tasks and outperforms existing localization and segmentation algorithms.Comment: 7 pages, 5 figures.This work has been accepted by ICRA 2019. The demo video can be found at https://youtu.be/Bkt53dAehj

    YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving

    Full text link
    In this paper, we propose a multi-task convolutional neural network (CNN) architecture optimized for a low power automotive grade SoC. We introduce a network based on a unified architecture where the encoder is shared among the two tasks namely detection and segmentation. The pro-posed network runs at 25FPS for 1280x800 resolution. We briefly discuss the methods used to optimize the network architecture such as using native YUV image directly, optimization of layers & feature maps and applying quantization. We also focus on memory bandwidth in our design as convolutions are data intensives and most SOCs are bandwidth bottlenecked. We then demonstrate the efficiency of our proposed network for a dedicated CNN accelerators presenting the key performance indicators (KPI) for the detection and segmentation tasks obtained from the hardware execution and the corresponding run-time.Comment: This paper is accepted for CVPR workshop dem

    Unified Multi-Criteria Chinese Word Segmentation with BERT

    Full text link
    Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word boundaries in a Chinese sentence composed of continuous characters while multiple segmentation criteria exist. The unified framework has been widely used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT language model has been also introduced into the MCCWS task in a multi-task learning framework. In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the bigram features and an auxiliary criterion classification task. Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS

    SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation

    Full text link
    One-shot image semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this paper, we propose a simple yet effective Similarity Guidance network to tackle the One-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image of the same category. To obtain the robust representative feature of the support image, we firstly adopt a masked average pooling strategy for producing the guidance features by only taking the pixels belonging to the support image into account. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adapted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework which can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SGOne achieves the mIoU score of 46.3%, surpassing the baseline methods

    Joint Iris Segmentation and Localization Using Deep Multi-task Learning Framework

    Full text link
    Iris segmentation and localization in non-cooperative environment is challenging due to illumination variations, long distances, moving subjects and limited user cooperation, etc. Traditional methods often suffer from poor performance when confronted with iris images captured in these conditions. Recent studies have shown that deep learning methods could achieve impressive performance on iris segmentation task. In addition, as iris is defined as an annular region between pupil and sclera, geometric constraints could be imposed to help locating the iris more accurately and improve the segmentation results. In this paper, we propose a deep multi-task learning framework, named as IrisParseNet, to exploit the inherent correlations between pupil, iris and sclera to boost up the performance of iris segmentation and localization in a unified model. In particular, IrisParseNet firstly applies a Fully Convolutional Encoder-Decoder Attention Network to simultaneously estimate pupil center, iris segmentation mask and iris inner/outer boundary. Then, an effective post-processing method is adopted for iris inner/outer circle localization.To train and evaluate the proposed method, we manually label three challenging iris datasets, namely CASIA-Iris-Distance, UBIRIS.v2, and MICHE-I, which cover various types of noises. Extensive experiments are conducted on these newly annotated datasets, and results show that our method outperforms state-of-the-art methods on various benchmarks. All the ground-truth annotations, annotation codes and evaluation protocols are publicly available at https://github.com/xiamenwcy/IrisParseNet.Comment: 13 page

    Dynamic-structured Semantic Propagation Network

    Full text link
    Semantic concept hierarchy is still under-explored for semantic segmentation due to the inefficiency and complicated optimization of incorporating structural inference into dense prediction. This lack of modeling semantic correlations also makes prior works must tune highly-specified models for each task due to the label discrepancy across datasets. It severely limits the generalization capability of segmentation models for open set concept vocabulary and annotation utilization. In this paper, we propose a Dynamic-Structured Semantic Propagation Network (DSSPN) that builds a semantic neuron graph by explicitly incorporating the semantic concept hierarchy into network construction. Each neuron represents the instantiated module for recognizing a specific type of entity such as a super-class (e.g. food) or a specific concept (e.g. pizza). During training, DSSPN performs the dynamic-structured neuron computation graph by only activating a sub-graph of neurons for each image in a principled way. A dense semantic-enhanced neural block is proposed to propagate the learned knowledge of all ancestor neurons into each fine-grained child neuron for feature evolving. Another merit of such semantic explainable structure is the ability of learning a unified model concurrently on diverse datasets by selectively activating different neuron sub-graphs for each annotation at each step. Extensive experiments on four public semantic segmentation datasets (i.e. ADE20K, COCO-Stuff, Cityscape and Mapillary) demonstrate the superiority of our DSSPN over state-of-the-art segmentation models. Moreoever, we demonstrate a universal segmentation model that is jointly trained on diverse datasets can surpass the performance of the common fine-tuning scheme for exploiting multiple domain knowledge.Comment: CVPR 201
    • …
    corecore