6,195 research outputs found
Prototype Refinement Network for Few-Shot Segmentation
Few-shot segmentation targets to segment new classes with few annotated
images provided. It is more challenging than traditional semantic segmentation
tasks that segment known classes with abundant annotated images. In this paper,
we propose a Prototype Refinement Network (PRNet) to attack the challenge of
few-shot segmentation. It firstly learns to bidirectionally extract prototypes
from both support and query images of the known classes. Furthermore, to
extract representative prototypes of the new classes, we use adaptation and
fusion for prototype refinement. The step of adaptation makes the model to
learn new concepts which is directly implemented by retraining. Prototype
fusion is firstly proposed which fuses support prototypes with query
prototypes, incorporating the knowledge from both sides. It is effective in
prototype refinement without importing extra learnable parameters. In this way,
the prototypes become more discriminative in low-data regimes. Experiments on
PASAL- and COCO- demonstrate the superiority of our method.
Especially on COCO-, PRNet significantly outperforms existing methods by
a large margin of 13.1\% in 1-shot setting
SimPropNet: Improved Similarity Propagation for Few-shot Image Segmentation
Few-shot segmentation (FSS) methods perform image segmentation for a
particular object class in a target (query) image, using a small set of
(support) image-mask pairs. Recent deep neural network based FSS methods
leverage high-dimensional feature similarity between the foreground features of
the support images and the query image features. In this work, we demonstrate
gaps in the utilization of this similarity information in existing methods, and
present a framework - SimPropNet, to bridge those gaps. We propose to jointly
predict the support and query masks to force the support features to share
characteristics with the query features. We also propose to utilize
similarities in the background regions of the query and support images using a
novel foreground-background attentive fusion mechanism. Our method achieves
state-of-the-art results for one-shot and five-shot segmentation on the
PASCAL-5i dataset. The paper includes detailed analysis and ablation studies
for the proposed improvements and quantitative comparisons with contemporary
methods.Comment: An updated version of this work was accepted at IJCAI 202
Few-Shot Segmentation Propagation with Guided Networks
Learning-based methods for visual segmentation have made progress on
particular types of segmentation tasks, but are limited by the necessary
supervision, the narrow definitions of fixed tasks, and the lack of control
during inference for correcting errors. To remedy the rigidity and annotation
burden of standard approaches, we address the problem of few-shot segmentation:
given few image and few pixel supervision, segment any images accordingly. We
propose guided networks, which extract a latent task representation from any
amount of supervision, and optimize our architecture end-to-end for fast,
accurate few-shot segmentation. Our method can switch tasks without further
optimization and quickly update when given more guidance. We report the first
results for segmentation from one pixel per concept and show real-time
interactive video segmentation. Our unified approach propagates pixel
annotations across space for interactive segmentation, across time for video
segmentation, and across scenes for semantic segmentation. Our guided segmentor
is state-of-the-art in accuracy for the amount of annotation and time. See
http://github.com/shelhamer/revolver for code, models, and more details
Semantic Instance Meets Salient Object: Study on Video Semantic Salient Instance Segmentation
Focusing on only semantic instances that only salient in a scene gains more
benefits for robot navigation and self-driving cars than looking at all objects
in the whole scene. This paper pushes the envelope on salient regions in a
video to decompose them into semantically meaningful components, namely,
semantic salient instances. We provide the baseline for the new task of video
semantic salient instance segmentation (VSSIS), that is, Semantic Instance -
Salient Object (SISO) framework. The SISO framework is simple yet efficient,
leveraging advantages of two different segmentation tasks, i.e. semantic
instance segmentation and salient object segmentation to eventually fuse them
for the final result. In SISO, we introduce a sequential fusion by looking at
overlapping pixels between semantic instances and salient regions to have
non-overlapping instances one by one. We also introduce a recurrent instance
propagation to refine the shapes and semantic meanings of instances, and an
identity tracking to maintain both the identity and the semantic meaning of
instances over the entire video. Experimental results demonstrated the
effectiveness of our SISO baseline, which can handle occlusions in videos. In
addition, to tackle the task of VSSIS, we augment the DAVIS-2017 benchmark
dataset by assigning semantic ground-truth for salient instance labels,
obtaining SEmantic Salient Instance Video (SESIV) dataset. Our SESIV dataset
consists of 84 high-quality video sequences with pixel-wisely per-frame
ground-truth labels.Comment: accepted in WACV 201
Fast Semantic Segmentation on Video Using Block Motion-Based Feature Interpolation
Convolutional networks optimized for accuracy on challenging, dense
prediction tasks are prohibitively slow to run on each frame in a video. The
spatial similarity of nearby video frames, however, suggests opportunity to
reuse computation. Existing work has explored basic feature reuse and feature
warping based on optical flow, but has encountered limits to the speedup
attainable with these techniques. In this paper, we present a new, two part
approach to accelerating inference on video. First, we propose a fast feature
propagation technique that utilizes the block motion vectors present in
compressed video (e.g. H.264 codecs) to cheaply propagate features from frame
to frame. Second, we develop a novel feature estimation scheme, termed feature
interpolation, that fuses features propagated from enclosing keyframes to
render accurate feature estimates, even at sparse keyframe frequencies. We
evaluate our system on the Cityscapes and CamVid datasets, comparing to both a
frame-by-frame baseline and related work. We find that we are able to
substantially accelerate segmentation on video, achieving near real-time frame
rates (20.1 frames per second) on large images (960 x 720 pixels), while
maintaining competitive accuracy. This represents an improvement of almost 6x
over the single-frame baseline and 2.5x over the fastest prior work.Comment: 14 page
Data augmentation using learned transformations for one-shot medical image segmentation
Image segmentation is an important task in many medical applications. Methods
based on convolutional neural networks attain state-of-the-art accuracy;
however, they typically rely on supervised training with large labeled
datasets. Labeling medical images requires significant expertise and time, and
typical hand-tuned approaches for data augmentation fail to capture the complex
variations in such images.
We present an automated data augmentation method for synthesizing labeled
medical images. We demonstrate our method on the task of segmenting magnetic
resonance imaging (MRI) brain scans. Our method requires only a single
segmented scan, and leverages other unlabeled scans in a semi-supervised
approach. We learn a model of transformations from the images, and use the
model along with the labeled example to synthesize additional labeled examples.
Each transformation is comprised of a spatial deformation field and an
intensity change, enabling the synthesis of complex effects such as variations
in anatomy and image acquisition procedures. We show that training a supervised
segmenter with these new examples provides significant improvements over
state-of-the-art methods for one-shot biomedical image segmentation. Our code
is available at https://github.com/xamyzhao/brainstorm.Comment: 9 pages, CVPR 201
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Boundary-Aware Network for Fast and High-Accuracy Portrait Segmentation
Compared with other semantic segmentation tasks, portrait segmentation
requires both higher precision and faster inference speed. However, this
problem has not been well studied in previous works. In this paper, we propose
a lightweight network architecture, called Boundary-Aware Network (BANet) which
selectively extracts detail information in boundary area to make high-quality
segmentation output with real-time( >25FPS) speed. In addition, we design a new
loss function called refine loss which supervises the network with image level
gradient information. Our model is able to produce finer segmentation results
which has richer details than annotations
Using Mise-En-Sc\`ene Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation
Item features play an important role in movie recommender systems, where
recommendations can be generated by using explicit or implicit preferences of
users on traditional features (attributes) such as tag, genre, and cast.
Typically, movie features are human-generated, either editorially (e.g., genre
and cast) or by leveraging the wisdom of the crowd (e.g., tag), and as such,
they are prone to noise and are expensive to collect. Moreover, these features
are often rare or absent for new items, making it difficult or even impossible
to provide good quality recommendations.
In this paper, we show that user's preferences on movies can be better
described in terms of the mise-en-sc\`ene features, i.e., the visual aspects of
a movie that characterize design, aesthetics and style (e.g., colors,
textures). We use both MPEG-7 visual descriptors and Deep Learning hidden
layers as example of mise-en-sc\`ene features that can visually describe
movies. Interestingly, mise-en-sc\`ene features can be computed automatically
from video files or even from trailers, offering more flexibility in handling
new items, avoiding the need for costly and error-prone human-based tagging,
and providing good scalability.
We have conducted a set of experiments on a large catalogue of 4K movies.
Results show that recommendations based on mise-en-sc\`ene features
consistently provide the best performance with respect to richer sets of more
traditional features, such as genre and tag.Comment: 8 pages, 3 figure
Class label autoencoder for zero-shot learning
Existing zero-shot learning (ZSL) methods usually learn a projection function
between a feature space and a semantic embedding space(text or attribute space)
in the training seen classes or testing unseen classes. However, the projection
function cannot be used between the feature space and multi-semantic embedding
spaces, which have the diversity characteristic for describing the different
semantic information of the same class. To deal with this issue, we present a
novel method to ZSL based on learning class label autoencoder (CLA). CLA can
not only build a uniform framework for adapting to multi-semantic embedding
spaces, but also construct the encoder-decoder mechanism for constraining the
bidirectional projection between the feature space and the class label space.
Moreover, CLA can jointly consider the relationship of feature classes and the
relevance of the semantic classes for improving zero-shot classification. The
CLA solution can provide both unseen class labels and the relation of the
different classes representation(feature or semantic information) that can
encode the intrinsic structure of classes. Extensive experiments demonstrate
the CLA outperforms state-of-art methods on four benchmark datasets, which are
AwA, CUB, Dogs and ImNet-2
- …