37,970 research outputs found
Spatiotemporal CNN for Video Object Segmentation
In this paper, we present a unified, end-to-end trainable spatiotemporal CNN
model for VOS, which consists of two branches, i.e., the temporal coherence
branch and the spatial segmentation branch. Specifically, the temporal
coherence branch pretrained in an adversarial fashion from unlabeled video
data, is designed to capture the dynamic appearance and motion cues of video
sequences to guide object segmentation. The spatial segmentation branch focuses
on segmenting objects accurately based on the learned appearance and motion
cues. To obtain accurate segmentation results, we design a coarse-to-fine
process to sequentially apply a designed attention module on multi-scale
feature maps, and concatenate them to produce the final prediction. In this
way, the spatial segmentation branch is enforced to gradually concentrate on
object regions. These two branches are jointly fine-tuned on video segmentation
sequences in an end-to-end manner. Several experiments are carried out on three
challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show
that our method achieves favorable performance against the state-of-the-arts.
Code is available at https://github.com/longyin880815/STCNN.Comment: 10 pages, 3 figures, 6 tables, CVPR 201
MAIN: Multi-Attention Instance Network for Video Segmentation
Instance-level video segmentation requires a solid integration of spatial and
temporal information. However, current methods rely mostly on domain-specific
information (online learning) to produce accurate instance-level segmentations.
We propose a novel approach that relies exclusively on the integration of
generic spatio-temporal attention cues. Our strategy, named Multi-Attention
Instance Network (MAIN), overcomes challenging segmentation scenarios over
arbitrary videos without modelling sequence- or instance-specific knowledge. We
design MAIN to segment multiple instances in a single forward pass, and
optimize it with a novel loss function that favors class agnostic predictions
and assigns instance-specific penalties. We achieve state-of-the-art
performance on the challenging Youtube-VOS dataset and benchmark, improving the
unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at
real-time (30.3 FPS)
Detecting Temporally Consistent Objects in Videos through Object Class Label Propagation
Object proposals for detecting moving or static video objects need to address
issues such as speed, memory complexity and temporal consistency. We propose an
efficient Video Object Proposal (VOP) generation method and show its efficacy
in learning a better video object detector. A deep-learning based video object
detector learned using the proposed VOP achieves state-of-the-art detection
performance on the Youtube-Objects dataset. We further propose a clustering of
VOPs which can efficiently be used for detecting objects in video in a
streaming fashion. As opposed to applying per-frame convolutional neural
network (CNN) based object detection, our proposed method called Objects in
Video Enabler thRough LAbel Propagation (OVERLAP) needs to classify only a
small fraction of all candidate proposals in every video frame through
streaming clustering of object proposals and class-label propagation. Source
code will be made available soon.Comment: Accepted for publication in WACV 201
Video Object Segmentation with Language Referring Expressions
Most state-of-the-art semi-supervised video object segmentation methods rely
on a pixel-accurate mask of a target object provided for the first frame of a
video. However, obtaining a detailed segmentation mask is expensive and
time-consuming. In this work we explore an alternative way of identifying a
target object, namely by employing language referring expressions. Besides
being a more practical and natural way of pointing out a target object, using
language specifications can help to avoid drift as well as make the system more
robust to complex dynamics and appearance variations. Leveraging recent
advances of language grounding models designed for images, we propose an
approach to extend them to video data, ensuring temporally coherent
predictions. To evaluate our method we augment the popular video object
segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of
target objects. We show that our language-supervised approach performs on par
with the methods which have access to a pixel-level mask of the target object
on DAVIS'16 and is competitive to methods using scribbles on the challenging
DAVIS'17 dataset.Comment: ACCV 2018: 14th Asian Conference on Computer Visio
MaskRNN: Instance Level Video Object Segmentation
Instance level video object segmentation is an important technique for video
editing and compression. To capture the temporal coherence, in this paper, we
develop MaskRNN, a recurrent neural net approach which fuses in each frame the
output of two deep nets for each object instance -- a binary segmentation net
providing a mask and a localization net providing a bounding box. Due to the
recurrent component and the localization component, our method is able to take
advantage of long-term temporal structures of the video data as well as
rejecting outliers. We validate the proposed algorithm on three challenging
benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the
Segtrack v2 dataset, achieving state-of-the-art performance on all of them.Comment: Accepted to NIPS 201
Tukey-Inspired Video Object Segmentation
We investigate the problem of strictly unsupervised video object
segmentation, i.e., the separation of a primary object from background in video
without a user-provided object mask or any training on an annotated dataset. We
find foreground objects in low-level vision data using a John Tukey-inspired
measure of "outlierness". This Tukey-inspired measure also estimates the
reliability of each data source as video characteristics change (e.g., a camera
starts moving). The proposed method achieves state-of-the-art results for
strictly unsupervised video object segmentation on the challenging DAVIS
dataset. Finally, we use a variant of the Tukey-inspired measure to combine the
output of multiple segmentation methods, including those using supervision
during training, runtime, or both. This collectively more robust method of
segmentation improves the Jaccard measure of its constituent methods by as much
as 28%
Fast and Accurate Online Video Object Segmentation via Tracking Parts
Online video object segmentation is a challenging task as it entails to
process the image sequence timely and accurately. To segment a target object
through the video, numerous CNN-based methods have been developed by heavily
finetuning on the object mask in the first frame, which is time-consuming for
online applications. In this paper, we propose a fast and accurate video object
segmentation algorithm that can immediately start the segmentation process once
receiving the images. We first utilize a part-based tracking method to deal
with challenging factors such as large deformation, occlusion, and cluttered
background. Based on the tracked bounding boxes of parts, we construct a
region-of-interest segmentation network to generate part masks. Finally, a
similarity-based scoring function is adopted to refine these object parts by
comparing them to the visual information in the first frame. Our method
performs favorably against state-of-the-art algorithms in accuracy on the DAVIS
benchmark dataset, while achieving much faster runtime performance.Comment: Accepted in CVPR'18 as Spotlight. Code and model are available at
https://github.com/JingchunCheng/FAVO
SegFlow: Joint Learning for Video Object Segmentation and Optical Flow
This paper proposes an end-to-end trainable network, SegFlow, for
simultaneously predicting pixel-wise object segmentation and optical flow in
videos. The proposed SegFlow has two branches where useful information of
object segmentation and optical flow is propagated bidirectionally in a unified
framework. The segmentation branch is based on a fully convolutional network,
which has been proved effective in image segmentation task, and the optical
flow branch takes advantage of the FlowNet model. The unified framework is
trained iteratively offline to learn a generic notion, and fine-tuned online
for specific objects. Extensive experiments on both the video object
segmentation and optical flow datasets demonstrate that introducing optical
flow improves the performance of segmentation and vice versa, against the
state-of-the-art algorithms.Comment: Accepted in ICCV'17. Code is available at
https://sites.google.com/site/yihsuantsai/research/iccv17-segflo
Background Subtraction in Real Applications: Challenges, Current Models and Future Directions
Computer vision applications based on videos often require the detection of
moving objects in their first step. Background subtraction is then applied in
order to separate the background and the foreground. In literature, background
subtraction is surely among the most investigated field in computer vision
providing a big amount of publications. Most of them concern the application of
mathematical and machine learning models to be more robust to the challenges
met in videos. However, the ultimate goal is that the background subtraction
methods developed in research could be employed in real applications like
traffic surveillance. But looking at the literature, we can remark that there
is often a gap between the current methods used in real applications and the
current methods in fundamental research. In addition, the videos evaluated in
large-scale datasets are not exhaustive in the way that they only covered a
part of the complete spectrum of the challenges met in real applications. In
this context, we attempt to provide the most exhaustive survey as possible on
real applications that used background subtraction in order to identify the
real challenges met in practice, the current used background models and to
provide future directions. Thus, challenges are investigated in terms of
camera, foreground objects and environments. In addition, we identify the
background models that are effectively used in these applications in order to
find potential usable recent background models in terms of robustness, time and
memory requirements.Comment: Submitted to Computer Science Revie
Multi-Channel CNN-based Object Detection for Enhanced Situation Awareness
Object Detection is critical for automatic military operations. However, the
performance of current object detection algorithms is deficient in terms of the
requirements in military scenarios. This is mainly because the object presence
is hard to detect due to the indistinguishable appearance and dramatic changes
of object's size which is determined by the distance to the detection sensors.
Recent advances in deep learning have achieved promising results in many
challenging tasks. The state-of-the-art in object detection is represented by
convolutional neural networks (CNNs), such as the fast R-CNN algorithm. These
CNN-based methods improve the detection performance significantly on several
public generic object detection datasets. However, their performance on
detecting small objects or undistinguishable objects in visible spectrum images
is still insufficient. In this study, we propose a novel detection algorithm
for military objects by fusing multi-channel CNNs. We combine spatial, temporal
and thermal information by generating a three-channel image, and they will be
fused as CNN feature maps in an unsupervised manner. The backbone of our object
detection framework is from the fast R-CNN algorithm, and we utilize
cross-domain transfer learning technique to fine-tune the CNN model on
generated multi-channel images. In the experiments, we validated the proposed
method with the images from SENSIAC (Military Sensing Information Analysis
Centre) database and compared it with the state-of-the-art. The experimental
results demonstrated the effectiveness of the proposed method on both accuracy
and computational efficiency.Comment: Published at the Sensors & Electronics Technology (SET) panel
Symposium SET-241 on 9th NATO Military Sensing Symposiu
- …