13 research outputs found
Bringing Background into the Foreground: Making All Classes Equal in Weakly-supervised Video Semantic Segmentation
Pixel-level annotations are expensive and time-consuming to obtain. Hence,
weak supervision using only image tags could have a significant impact in
semantic segmentation. Recent years have seen great progress in
weakly-supervised semantic segmentation, whether from a single image or from
videos. However, most existing methods are designed to handle a single
background class. In practical applications, such as autonomous navigation, it
is often crucial to reason about multiple background classes. In this paper, we
introduce an approach to doing so by making use of classifier heatmaps. We then
develop a two-stream deep architecture that jointly leverages appearance and
motion, and design a loss based on our heatmaps to train it. Our experiments
demonstrate the benefits of our classifier heatmaps and of our two-stream
architecture on challenging urban scene datasets and on the YouTube-Objects
benchmark, where we obtain state-of-the-art results.Comment: 11 pages, 4 figures, 7 tables, Accepted in ICCV 201
Learning monocular 3D reconstruction of articulated categories from motion
Monocular 3D reconstruction of articulated object categories is challenging
due to the lack of training data and the inherent ill-posedness of the problem.
In this work we use video self-supervision, forcing the consistency of
consecutive 3D reconstructions by a motion-based cycle loss. This largely
improves both optimization-based and learning-based 3D mesh reconstruction. We
further introduce an interpretable model of 3D template deformations that
controls a 3D surface through the displacement of a small number of local,
learnable handles. We formulate this operation as a structured layer relying on
mesh-laplacian regularization and show that it can be trained in an end-to-end
manner. We finally introduce a per-sample numerical optimisation approach that
jointly optimises over mesh displacements and cameras within a video, boosting
accuracy both for training and also as test time post-processing. While relying
exclusively on a small set of videos collected per category for supervision, we
obtain state-of-the-art reconstructions with diverse shapes, viewpoints and
textures for multiple articulated object categories.Comment: For project website see
https://fkokkinos.github.io/video_3d_reconstruction
Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation
When a deep neural network is trained on data with only image-level labeling,
the regions activated in each image tend to identify only a small region of the
target object. We propose a method of using videos automatically harvested from
the web to identify a larger region of the target object by using temporal
information, which is not present in the static image. The temporal variations
in a video allow different regions of the target object to be activated. We
obtain an activated region in each frame of a video, and then aggregate the
regions from successive frames into a single image, using a warping technique
based on optical flow. The resulting localization maps cover more of the target
object, and can then be used as proxy ground-truth to train a segmentation
network. This simple approach outperforms existing methods under the same level
of supervision, and even approaches relying on extra annotations. Based on
VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4,
respectively, on PASCAL VOC 2012 test images, which represents a new
state-of-the-art.Comment: ICCV 201
CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
Weakly supervised video object localization (WSVOL) methods often rely on
visual and motion cues only, making them susceptible to inaccurate
localization. Recently, discriminative models have been explored using a
temporal class activation mapping (CAM) method. Although their results are
promising, objects are assumed to have limited movement from frame to frame,
leading to degradation in performance for relatively long-term dependencies. In
this paper, a novel CoLo-CAM method for WSVOL is proposed that leverages
spatiotemporal information in activation maps during training without making
assumptions about object position. Given a sequence of frames, explicit joint
learning of localization is produced based on color cues across these maps, by
assuming that an object has similar color across adjacent frames. CAM
activations are constrained to respond similarly over pixels with similar
colors, achieving co-localization. This joint learning creates direct
communication among pixels across all image locations and over all frames,
allowing for transfer, aggregation, and correction of learned localization,
leading to better localization performance. This is achieved by minimizing the
color term of a conditional random field (CRF) loss over a sequence of
frames/CAMs. Empirical experiments on two challenging datasets with
unconstrained videos, YouTube-Objects, show the merits of our method, and its
robustness to long-term dependencies, leading to new state-of-the-art
performance for WSVOL.Comment: 16 pages, 8 figure