1,120 research outputs found
Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
Beyond the existing single-person and multiple-person human parsing tasks in
static images, this paper makes the first attempt to investigate a more
realistic video instance-level human parsing that simultaneously segments out
each person instance and parses each instance into more fine-grained parts
(e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding
Network (ATEN) that alternatively performs temporal encoding among key frames
and flow-guided feature propagation from other consecutive frames between two
key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the
instance-level parsing result for each key frame, which integrates both the
global human parsing and instance-level human segmentation into a unified
model. To balance between accuracy and efficiency, the flow-guided feature
propagation is used to directly parse consecutive frames according to their
identified temporal consistency with key frames. On the other hand, ATEN
leverages the convolution gated recurrent units (convGRU) to exploit temporal
changes over a series of key frames, which are further used to facilitate the
frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding network among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation
research. To demonstrate the superiority of our ATEN, extensive experiments are
conducted on the most popular video segmentation benchmark (DAVIS) and a newly
collected Video Instance-level Parsing (VIP) dataset, which is the first video
instance-level human parsing dataset comprised of 404 sequences and over 20k
frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link:
https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li
Geometry meets semantics for semi-supervised monocular depth estimation
Depth estimation from a single image represents a very exciting challenge in
computer vision. While other image-based depth sensing techniques leverage on
the geometry between different viewpoints (e.g., stereo or structure from
motion), the lack of these cues within a single image renders ill-posed the
monocular depth estimation task. For inference, state-of-the-art
encoder-decoder architectures for monocular depth estimation rely on effective
feature representations learned at training time. For unsupervised training of
these models, geometry has been effectively exploited by suitable images
warping losses computed from views acquired by a stereo rig or a moving camera.
In this paper, we make a further step forward showing that learning semantic
information from images enables to improve effectively monocular depth
estimation as well. In particular, by leveraging on semantically labeled images
together with unsupervised signals gained by geometry through an image warping
loss, we propose a deep learning approach aimed at joint semantic segmentation
and depth estimation. Our overall learning framework is semi-supervised, as we
deploy groundtruth data only in the semantic domain. At training time, our
network learns a common feature representation for both tasks and a novel
cross-task loss function is proposed. The experimental findings show how,
jointly tackling depth prediction and semantic segmentation, allows to improve
depth estimation accuracy. In particular, on the KITTI dataset our network
outperforms state-of-the-art methods for monocular depth estimation.Comment: 16 pages, Accepted to ACCV 201
Mask Propagation for Efficient Video Semantic Segmentation
Video Semantic Segmentation (VSS) involves assigning a semantic label to each
pixel in a video sequence. Prior work in this field has demonstrated promising
results by extending image semantic segmentation models to exploit temporal
relationships across video frames; however, these approaches often incur
significant computational costs. In this paper, we propose an efficient mask
propagation framework for VSS, called MPVSS. Our approach first employs a
strong query-based image segmentor on sparse key frames to generate accurate
binary masks and class predictions. We then design a flow estimation module
utilizing the learned queries to generate a set of segment-aware flow maps,
each associated with a mask prediction from the key frame. Finally, the
mask-flow pairs are warped to serve as the mask predictions for the non-key
frames. By reusing predictions from key frames, we circumvent the need to
process a large volume of video frames individually with resource-intensive
segmentors, alleviating temporal redundancy and significantly reducing
computational costs. Extensive experiments on VSPW and Cityscapes demonstrate
that our mask propagation framework achieves SOTA accuracy and efficiency
trade-offs. For instance, our best model with Swin-L backbone outperforms the
SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW
dataset. Moreover, our framework reduces up to 4x FLOPs compared to the
per-frame Mask2Former baseline with only up to 2% mIoU degradation on the
Cityscapes validation set. Code is available at
https://github.com/ziplab/MPVSS.Comment: NeurIPS 202
Map-Guided Curriculum Domain Adaptation and Uncertainty-Aware Evaluation for Semantic Nighttime Image Segmentation
We address the problem of semantic nighttime image segmentation and improve
the state-of-the-art, by adapting daytime models to nighttime without using
nighttime annotations. Moreover, we design a new evaluation framework to
address the substantial uncertainty of semantics in nighttime images. Our
central contributions are: 1) a curriculum framework to gradually adapt
semantic segmentation models from day to night through progressively darker
times of day, exploiting cross-time-of-day correspondences between daytime
images from a reference map and dark images to guide the label inference in the
dark domains; 2) a novel uncertainty-aware annotation and evaluation framework
and metric for semantic segmentation, including image regions beyond human
recognition capability in the evaluation in a principled fashion; 3) the Dark
Zurich dataset, comprising 2416 unlabeled nighttime and 2920 unlabeled twilight
images with correspondences to their daytime counterparts plus a set of 201
nighttime images with fine pixel-level annotations created with our protocol,
which serves as a first benchmark for our novel evaluation. Experiments show
that our map-guided curriculum adaptation significantly outperforms
state-of-the-art methods on nighttime sets both for standard metrics and our
uncertainty-aware metric. Furthermore, our uncertainty-aware evaluation reveals
that selective invalidation of predictions can improve results on data with
ambiguous content such as our benchmark and profit safety-oriented applications
involving invalid inputs.Comment: IEEE T-PAMI 202
Time-Space Transformers for Video Panoptic Segmentation
We propose a novel solution for the task of video panoptic segmentation, that
simultaneously predicts pixel-level semantic and instance segmentation and
generates clip-level instance tracks. Our network, named VPS-Transformer, with
a hybrid architecture based on the state-of-the-art panoptic segmentation
network Panoptic-DeepLab, combines a convolutional architecture for
single-frame panoptic segmentation and a novel video module based on an
instantiation of the pure Transformer block. The Transformer, equipped with
attention mechanisms, models spatio-temporal relations between backbone output
features of current and past frames for more accurate and consistent panoptic
estimates. As the pure Transformer block introduces large computation overhead
when processing high resolution images, we propose a few design changes for a
more efficient compute. We study how to aggregate information more effectively
over the space-time volume and we compare several variants of the Transformer
block with different attention schemes. Extensive experiments on the
Cityscapes-VPS dataset demonstrate that our best model improves the temporal
consistency and video panoptic quality by a margin of 2.2%, with little extra
computation
GPS-GLASS: Learning Nighttime Semantic Segmentation Using Daytime Video and GPS data
Semantic segmentation for autonomous driving should be robust against various
in-the-wild environments. Nighttime semantic segmentation is especially
challenging due to a lack of annotated nighttime images and a large domain gap
from daytime images with sufficient annotation. In this paper, we propose a
novel GPS-based training framework for nighttime semantic segmentation. Given
GPS-aligned pairs of daytime and nighttime images, we perform cross-domain
correspondence matching to obtain pixel-level pseudo supervision. Moreover, we
conduct flow estimation between daytime video frames and apply GPS-based
scaling to acquire another pixel-level pseudo supervision. Using these pseudo
supervisions with a confidence map, we train a nighttime semantic segmentation
network without any annotation from nighttime images. Experimental results
demonstrate the effectiveness of the proposed method on several nighttime
semantic segmentation datasets. Our source code is available at
https://github.com/jimmy9704/GPS-GLASS.Comment: ICCVW 202
- …