43,868 research outputs found
Spatiotemporal CNN for Video Object Segmentation
In this paper, we present a unified, end-to-end trainable spatiotemporal CNN
model for VOS, which consists of two branches, i.e., the temporal coherence
branch and the spatial segmentation branch. Specifically, the temporal
coherence branch pretrained in an adversarial fashion from unlabeled video
data, is designed to capture the dynamic appearance and motion cues of video
sequences to guide object segmentation. The spatial segmentation branch focuses
on segmenting objects accurately based on the learned appearance and motion
cues. To obtain accurate segmentation results, we design a coarse-to-fine
process to sequentially apply a designed attention module on multi-scale
feature maps, and concatenate them to produce the final prediction. In this
way, the spatial segmentation branch is enforced to gradually concentrate on
object regions. These two branches are jointly fine-tuned on video segmentation
sequences in an end-to-end manner. Several experiments are carried out on three
challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show
that our method achieves favorable performance against the state-of-the-arts.
Code is available at https://github.com/longyin880815/STCNN.Comment: 10 pages, 3 figures, 6 tables, CVPR 201
DADA: Depth-aware Domain Adaptation in Semantic Segmentation
Unsupervised domain adaptation (UDA) is important for applications where
large scale annotation of representative data is challenging. For semantic
segmentation in particular, it helps deploy on real "target domain" data models
that are trained on annotated images from a different "source domain", notably
a virtual environment. To this end, most previous works consider semantic
segmentation as the only mode of supervision for source domain data, while
ignoring other, possibly available, information like depth. In this work, we
aim at exploiting at best such a privileged information while training the UDA
model. We propose a unified depth-aware UDA framework that leverages in several
complementary ways the knowledge of dense depth in the source domain. As a
result, the performance of the trained semantic segmentation model on the
target domain is boosted. Our novel approach indeed achieves state-of-the-art
performance on different challenging synthetic-2-real benchmarks.Comment: Accepted in ICCV'1
An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos
In this paper, we propose an end-to-end 3D CNN for action detection and
segmentation in videos. The proposed architecture is a unified deep network
that is able to recognize and localize action based on 3D convolution features.
A video is first divided into equal length clips and next for each clip a set
of tube proposals are generated based on 3D CNN features. Finally, the tube
proposals of different clips are linked together and spatio-temporal action
detection is performed using these linked video proposals. This top-down action
detection approach explicitly relies on a set of good tube proposals to perform
well and training the bounding box regression usually requires a large number
of annotated samples. To remedy this, we further extend the 3D CNN to an
encoder-decoder structure and formulate the localization problem as action
segmentation. The foreground regions (i.e. action regions) for each frame are
segmented first then the segmented foreground maps are used to generate the
bounding boxes. This bottom-up approach effectively avoids tube proposal
generation by leveraging the pixel-wise annotations of segmentation. The
segmentation framework also can be readily applied to a general problem of
video object segmentation. Extensive experiments on several video datasets
demonstrate the superior performance of our approach for action detection and
video object segmentation compared to the state-of-the-arts.Comment: arXiv admin note: substantial text overlap with arXiv:1703.1066
Dominant Sets for "Constrained" Image Segmentation
Image segmentation has come a long way since the early days of computer
vision, and still remains a challenging task. Modern variations of the
classical (purely bottom-up) approach, involve, e.g., some form of user
assistance (interactive segmentation) or ask for the simultaneous segmentation
of two or more images (co-segmentation). At an abstract level, all these
variants can be thought of as "constrained" versions of the original
formulation, whereby the segmentation process is guided by some external source
of information. In this paper, we propose a new approach to tackle this kind of
problems in a unified way. Our work is based on some properties of a family of
quadratic optimization problems related to dominant sets, a well-known
graph-theoretic notion of a cluster which generalizes the concept of a maximal
clique to edge-weighted graphs. In particular, we show that by properly
controlling a regularization parameter which determines the structure and the
scale of the underlying problem, we are in a position to extract groups of
dominant-set clusters that are constrained to contain predefined elements. In
particular, we shall focus on interactive segmentation and co-segmentation (in
both the unsupervised and the interactive versions). The proposed algorithm can
deal naturally with several type of constraints and input modality, including
scribbles, sloppy contours, and bounding boxes, and is able to robustly handle
noisy annotations on the part of the user. Experiments on standard benchmark
datasets show the effectiveness of our approach as compared to state-of-the-art
algorithms on a variety of natural images under several input conditions and
constraints.Comment: arXiv admin note: text overlap with arXiv:1608.0064
A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation
This paper presents a novel framework for simultaneously implementing
localization and segmentation, which are two of the most important vision-based
tasks for robotics. While the goals and techniques used for them were
considered to be different previously, we show that by making use of the
intermediate results of the two modules, their performance can be enhanced at
the same time. Our framework is able to handle both the instantaneous motion
and long-term changes of instances in localization with the help of the
segmentation result, which also benefits from the refined 3D pose information.
We conduct experiments on various datasets, and prove that our framework works
effectively on improving the precision and robustness of the two tasks and
outperforms existing localization and segmentation algorithms.Comment: 7 pages, 5 figures.This work has been accepted by ICRA 2019. The demo
video can be found at https://youtu.be/Bkt53dAehj
YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving
In this paper, we propose a multi-task convolutional neural network (CNN)
architecture optimized for a low power automotive grade SoC. We introduce a
network based on a unified architecture where the encoder is shared among the
two tasks namely detection and segmentation. The pro-posed network runs at
25FPS for 1280x800 resolution. We briefly discuss the methods used to optimize
the network architecture such as using native YUV image directly, optimization
of layers & feature maps and applying quantization. We also focus on memory
bandwidth in our design as convolutions are data intensives and most SOCs are
bandwidth bottlenecked. We then demonstrate the efficiency of our proposed
network for a dedicated CNN accelerators presenting the key performance
indicators (KPI) for the detection and segmentation tasks obtained from the
hardware execution and the corresponding run-time.Comment: This paper is accepted for CVPR workshop dem
Unified Multi-Criteria Chinese Word Segmentation with BERT
Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word
boundaries in a Chinese sentence composed of continuous characters while
multiple segmentation criteria exist. The unified framework has been widely
used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT
language model has been also introduced into the MCCWS task in a multi-task
learning framework. In this paper, we combine the superiority of the unified
framework and pretrained language model, and propose a unified MCCWS model
based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the
bigram features and an auxiliary criterion classification task. Experiments on
eight datasets with diverse criteria demonstrate that our methods could achieve
new state-of-the-art results for MCCWS
SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation
One-shot image semantic segmentation poses a challenging task of recognizing
the object regions from unseen categories with only one annotated example as
supervision. In this paper, we propose a simple yet effective Similarity
Guidance network to tackle the One-shot (SG-One) segmentation problem. We aim
at predicting the segmentation mask of a query image with the reference to one
densely labeled support image of the same category. To obtain the robust
representative feature of the support image, we firstly adopt a masked average
pooling strategy for producing the guidance features by only taking the pixels
belonging to the support image into account. We then leverage the cosine
similarity to build the relationship between the guidance features and features
of pixels from the query image. In this way, the possibilities embedded in the
produced similarity maps can be adapted to guide the process of segmenting
objects. Furthermore, our SG-One is a unified framework which can efficiently
process both support and query images within one network and be learned in an
end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In
particular, our SGOne achieves the mIoU score of 46.3%, surpassing the baseline
methods
Joint Iris Segmentation and Localization Using Deep Multi-task Learning Framework
Iris segmentation and localization in non-cooperative environment is
challenging due to illumination variations, long distances, moving subjects and
limited user cooperation, etc. Traditional methods often suffer from poor
performance when confronted with iris images captured in these conditions.
Recent studies have shown that deep learning methods could achieve impressive
performance on iris segmentation task. In addition, as iris is defined as an
annular region between pupil and sclera, geometric constraints could be imposed
to help locating the iris more accurately and improve the segmentation results.
In this paper, we propose a deep multi-task learning framework, named as
IrisParseNet, to exploit the inherent correlations between pupil, iris and
sclera to boost up the performance of iris segmentation and localization in a
unified model. In particular, IrisParseNet firstly applies a Fully
Convolutional Encoder-Decoder Attention Network to simultaneously estimate
pupil center, iris segmentation mask and iris inner/outer boundary. Then, an
effective post-processing method is adopted for iris inner/outer circle
localization.To train and evaluate the proposed method, we manually label three
challenging iris datasets, namely CASIA-Iris-Distance, UBIRIS.v2, and MICHE-I,
which cover various types of noises. Extensive experiments are conducted on
these newly annotated datasets, and results show that our method outperforms
state-of-the-art methods on various benchmarks. All the ground-truth
annotations, annotation codes and evaluation protocols are publicly available
at https://github.com/xiamenwcy/IrisParseNet.Comment: 13 page
Dynamic-structured Semantic Propagation Network
Semantic concept hierarchy is still under-explored for semantic segmentation
due to the inefficiency and complicated optimization of incorporating
structural inference into dense prediction. This lack of modeling semantic
correlations also makes prior works must tune highly-specified models for each
task due to the label discrepancy across datasets. It severely limits the
generalization capability of segmentation models for open set concept
vocabulary and annotation utilization. In this paper, we propose a
Dynamic-Structured Semantic Propagation Network (DSSPN) that builds a semantic
neuron graph by explicitly incorporating the semantic concept hierarchy into
network construction. Each neuron represents the instantiated module for
recognizing a specific type of entity such as a super-class (e.g. food) or a
specific concept (e.g. pizza). During training, DSSPN performs the
dynamic-structured neuron computation graph by only activating a sub-graph of
neurons for each image in a principled way. A dense semantic-enhanced neural
block is proposed to propagate the learned knowledge of all ancestor neurons
into each fine-grained child neuron for feature evolving. Another merit of such
semantic explainable structure is the ability of learning a unified model
concurrently on diverse datasets by selectively activating different neuron
sub-graphs for each annotation at each step. Extensive experiments on four
public semantic segmentation datasets (i.e. ADE20K, COCO-Stuff, Cityscape and
Mapillary) demonstrate the superiority of our DSSPN over state-of-the-art
segmentation models. Moreoever, we demonstrate a universal segmentation model
that is jointly trained on diverse datasets can surpass the performance of the
common fine-tuning scheme for exploiting multiple domain knowledge.Comment: CVPR 201
- …