10,559 research outputs found
Co-attention Propagation Network for Zero-Shot Video Object Segmentation
Zero-shot video object segmentation (ZS-VOS) aims to segment foreground
objects in a video sequence without prior knowledge of these objects. However,
existing ZS-VOS methods often struggle to distinguish between foreground and
background or to keep track of the foreground in complex scenarios. The common
practice of introducing motion information, such as optical flow, can lead to
overreliance on optical flow estimation. To address these challenges, we
propose an encoder-decoder-based hierarchical co-attention propagation network
(HCPN) capable of tracking and segmenting objects. Specifically, our model is
built upon multiple collaborative evolutions of the parallel co-attention
module (PCM) and the cross co-attention module (CCM). PCM captures common
foreground regions among adjacent appearance and motion features, while CCM
further exploits and fuses cross-modal motion features returned by PCM. Our
method is progressively trained to achieve hierarchical spatio-temporal feature
propagation across the entire video. Experimental results demonstrate that our
HCPN outperforms all previous methods on public benchmarks, showcasing its
effectiveness for ZS-VOS.Comment: accepted by IEEE Transactions on Image Processin
Bringing Background into the Foreground: Making All Classes Equal in Weakly-supervised Video Semantic Segmentation
Pixel-level annotations are expensive and time-consuming to obtain. Hence,
weak supervision using only image tags could have a significant impact in
semantic segmentation. Recent years have seen great progress in
weakly-supervised semantic segmentation, whether from a single image or from
videos. However, most existing methods are designed to handle a single
background class. In practical applications, such as autonomous navigation, it
is often crucial to reason about multiple background classes. In this paper, we
introduce an approach to doing so by making use of classifier heatmaps. We then
develop a two-stream deep architecture that jointly leverages appearance and
motion, and design a loss based on our heatmaps to train it. Our experiments
demonstrate the benefits of our classifier heatmaps and of our two-stream
architecture on challenging urban scene datasets and on the YouTube-Objects
benchmark, where we obtain state-of-the-art results.Comment: 11 pages, 4 figures, 7 tables, Accepted in ICCV 201
- …