8 research outputs found
Semantic Video CNNs through Representation Warping
In this work, we propose a technique to convert CNN models for semantic
segmentation of static images into CNNs for video data. We describe a warping
method that can be used to augment existing architectures with very little
extra computational cost. This module is called NetWarp and we demonstrate its
use for a range of network architectures. The main design principle is to use
optical flow of adjacent frames for warping internal network representations
across time. A key insight of this work is that fast optical flow methods can
be combined with many different CNN architectures for improved performance and
end-to-end training. Experiments validate that the proposed approach incurs
only little extra computational cost, while improving performance, when video
streams are available. We achieve new state-of-the-art results on the CamVid
and Cityscapes benchmark datasets and show consistent improvements over
different baseline networks. Our code and models will be available at
http://segmentation.is.tue.mpg.deComment: ICCV 201
Video Propagation Networks
We propose a technique that propagates information forward through video
data. The method is conceptually simple and can be applied to tasks that
require the propagation of structured information, such as semantic labels,
based on video content. We propose a 'Video Propagation Network' that processes
video frames in an adaptive manner. The model is applied online: it propagates
information forward without the need to access future frames. In particular we
combine two components, a temporal bilateral network for dense and video
adaptive filtering, followed by a spatial network to refine features and
increased flexibility. We present experiments on video object segmentation and
semantic video segmentation and show increased performance comparing to the
best previous task-specific methods, while having favorable runtime.
Additionally we demonstrate our approach on an example regression task of color
propagation in a grayscale video.Comment: Appearing in Computer Vision and Pattern Recognition, 2017 (CVPR'17
Temporally Consistent Multi-Class Video-Object Segmentation with the Video Graph-Shifts Algorithm
input frame We present the Video Graph-Shifts (VGS) approach for efficiently incorporating temporal consistency into MRF energy minimization for multi-class video object segmentation. In contrast to previous methods, our dynamic temporal links avoid the computational overhead of using a fully connected spatiotemporal MRF, while still being able to deal with the uncertainties of the exact inter-frame pixel correspondence issues. The dynamic temporal links are initialized flexibly for balancing between speed and accuracy, and are automatically revised whenever a label change (shift) occurs during the energy minimization process. We show in the benchmark CamVid database and our own wintry driving dataset that VGS improves the issue of temporally inconsistent segmentation effectively—enhancements of up to 5 % to 10 % for those semantic classes with high intra-class variance. Furthermore, VGS processes each frame at pixel resolution in about one second, which provides a practical way of modeling complex probabilistic relationships in videos and solving it in near real-time. 1
Scale-Adaptive Video Understanding.
The recent rise of large-scale, diverse video data has urged a new era of high-level video understanding. It is increasingly critical for intelligent systems to extract semantics from videos. In this dissertation, we explore the use of supervoxel hierarchies as a type of video representation for high-level video understanding. The supervoxel hierarchies contain rich multiscale decompositions of video content, where various structures can be found at various levels. However, no single level of scale contains all the desired structures we need. It is essential to adaptively choose the scales for subsequent video analysis. Thus, we present a set of tools to manipulate scales in supervoxel hierarchies including both scale generation and scale selection methods.
In our scale generation work, we evaluate a set of seven supervoxel methods in the context of what we consider to be a good supervoxel for video representation. We address a key limitation that has traditionally prevented supervoxel scale generation on long videos. We do so by proposing an approximation framework for streaming hierarchical scale generation that is able to generate multiscale decompositions for arbitrarily-long videos using constant memory.
Subsequently, we present two scale selection methods that are able to adaptively choose the scales according to application needs. The first method flattens the entire supervoxel hierarchy into a single segmentation that overcomes the limitation induced by trivial selection of a single scale. We show that the selection can be driven by various post hoc feature criteria. The second scale selection method combines the supervoxel hierarchy with a conditional random field for the task of labeling actors and actions in videos. We formulate the scale selection problem and the video labeling problem in a joint framework. Experiments on a novel large-scale video dataset demonstrate the effectiveness of the explicit consideration of scale selection in video understanding.
Aside from the computational methods, we present a visual psychophysical study to quantify how well the actor and action semantics in high-level video understanding are retained in supervoxel hierarchies. The ultimate findings suggest that some semantics are well-retained in the supervoxel hierarchies and can be used for further video analysis.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133202/1/cliangxu_1.pd
Efficient multi-level scene understanding in videos
Automatic video parsing is a key step towards human-level dynamic
scene understanding, and a fundamental problem in computer
vision.
A core issue in video understanding is to infer multiple scene
properties of a video in an efficient and consistent manner. This
thesis addresses the problem of holistic scene understanding from
monocular videos, which jointly reason about semantic and
geometric scene properties from multiple levels, including
pixelwise annotation of video frames, object instance
segmentation in spatio-temporal domain, and/or scene-level
description in terms of scene categories and layouts.
We focus on four main issues in the holistic video understanding:
1) what is the representation for consistent semantic and
geometric parsing of videos? 2) how do we integrate high-level
reasoning (e.g., objects) with pixel-wise video parsing? 3) how
can we do efficient inference for multi-level video
understanding? and 4) what is the representation learning
strategy for efficient/cost-aware scene parsing?
We discuss three multi-level video scene segmentation scenarios
based on different aspects of scene properties and efficiency
requirements. The first case addresses the problem of consistent
geometric and semantic video segmentation for outdoor scenes.
We propose a geometric scene layout representation, or a stage
scene model, to efficiently capture the dependency between the
semantic and geometric labels.
We build a unified conditional random field for joint modeling of
the semantic class, geometric label and the stage representation,
and design an alternating inference algorithm to minimize the
resulting energy function. The second case focuses on the problem
of simultaneous pixel-level and object-level segmentation in
videos. We propose to incorporate foreground object information
into pixel labeling by jointly reasoning semantic labels of
supervoxels, object instance tracks and geometric relations
between objects. In order to model objects, we take an exemplar
approach based on a small set of object annotations to generate
a set of object proposals. We then design a conditional random
field framework that jointly models the supervoxel labels and
object instance segments. To scale up our method, we develop an
active inference strategy to improve the efficiency of
multi-level video parsing, which adaptively selects an
informative subset of object proposals and performs inference on
the resulting compact model.
The last case explores the problem of learning a flexible
representation for efficient scene labeling. We propose a dynamic
hierarchical model that allows us to achieve flexible trade-offs
between efficiency and accuracy. Our approach incorporates the
cost of feature computation and model inference, and optimizes
the model performance for any given test-time budget. We evaluate
all our methods on several publicly available video and image
semantic segmentation datasets, and demonstrate superior
performance in efficiency and accuracy.
Keywords: Semantic video segmentation, Multi-level scene
understanding, Efficient inference, Cost-aware scene parsin