Search CORE

10,559 research outputs found

Co-attention Propagation Network for Zero-Shot Video Object Segmentation

Author: Huang Dan
Huang Xingguo
Pei Gensheng
Shen Fumin
Shen Heng-Tao
Yao Yazhou
Publication venue
Publication date: 08/04/2023
Field of study

Zero-shot video object segmentation (ZS-VOS) aims to segment foreground objects in a video sequence without prior knowledge of these objects. However, existing ZS-VOS methods often struggle to distinguish between foreground and background or to keep track of the foreground in complex scenarios. The common practice of introducing motion information, such as optical flow, can lead to overreliance on optical flow estimation. To address these challenges, we propose an encoder-decoder-based hierarchical co-attention propagation network (HCPN) capable of tracking and segmenting objects. Specifically, our model is built upon multiple collaborative evolutions of the parallel co-attention module (PCM) and the cross co-attention module (CCM). PCM captures common foreground regions among adjacent appearance and motion features, while CCM further exploits and fuses cross-modal motion features returned by PCM. Our method is progressively trained to achieve hierarchical spatio-temporal feature propagation across the entire video. Experimental results demonstrate that our HCPN outperforms all previous methods on public benchmarks, showcasing its effectiveness for ZS-VOS.Comment: accepted by IEEE Transactions on Image Processin

arXiv.org e-Print Archive

Bringing Background into the Foreground: Making All Classes Equal in Weakly-supervised Video Semantic Segmentation

Author: Aliakbarian Mohammad Sadegh
Alvarez Jose M.
Petersson Lars
Saleh Fatemeh Sadat
Salzmann Mathieu
Publication venue
Publication date: 15/08/2017
Field of study

Pixel-level annotations are expensive and time-consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recent years have seen great progress in weakly-supervised semantic segmentation, whether from a single image or from videos. However, most existing methods are designed to handle a single background class. In practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. In this paper, we introduce an approach to doing so by making use of classifier heatmaps. We then develop a two-stream deep architecture that jointly leverages appearance and motion, and design a loss based on our heatmaps to train it. Our experiments demonstrate the benefits of our classifier heatmaps and of our two-stream architecture on challenging urban scene datasets and on the YouTube-Objects benchmark, where we obtain state-of-the-art results.Comment: 11 pages, 4 figures, 7 tables, Accepted in ICCV 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne