11 research outputs found
Robust Semantic Segmentation with Ladder-DenseNet Models
We present semantic segmentation experiments with a model capable to perform
predictions on four benchmark datasets: Cityscapes, ScanNet, WildDash and
KITTI. We employ a ladder-style convolutional architecture featuring a modified
DenseNet-169 model in the downsampling datapath, and only one convolution in
each stage of the upsampling datapath. Due to limited computing resources, we
perform the training only on Cityscapes Fine train+val, ScanNet train, WildDash
val and KITTI train. We evaluate the trained model on the test subsets of the
four benchmarks in concordance with the guidelines of the Robust Vision
Challenge ROB 2018. The performed experiments reveal several interesting
findings which we describe and discuss.Comment: 4 pages, 4 figures, CVPR 2018 Robust Vision Challenge Worksho
Improving Semantic Segmentation via Video Propagation and Label Relaxation
Semantic segmentation requires large amounts of pixel-wise annotations to
learn accurate models. In this paper, we present a video prediction-based
methodology to scale up training sets by synthesizing new training samples in
order to improve the accuracy of semantic segmentation networks. We exploit
video prediction models' ability to predict future frames in order to also
predict future labels. A joint propagation strategy is also proposed to
alleviate mis-alignments in synthesized samples. We demonstrate that training
segmentation models on datasets augmented by the synthesized samples leads to
significant improvements in accuracy. Furthermore, we introduce a novel
boundary label relaxation technique that makes training robust to annotation
noise and propagation artifacts along object boundaries. Our proposed methods
achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our
single model, without model ensembles, achieves 72.8% mIoU on the KITTI
semantic segmentation test set, which surpasses the winning entry of the ROB
challenge 2018. Our code and videos can be found at
https://nv-adlr.github.io/publication/2018-Segmentation.Comment: CVPR 2019 Oral. Code link:
https://github.com/NVIDIA/semantic-segmentation. YouTube link:
https://www.youtube.com/watch?v=aEbXjGZDZS
Multi-layer Feature Aggregation for Deep Scene Parsing Models
Scene parsing from images is a fundamental yet challenging problem in visual
content understanding. In this dense prediction task, the parsing model assigns
every pixel to a categorical label, which requires the contextual information
of adjacent image patches. So the challenge for this learning task is to
simultaneously describe the geometric and semantic properties of objects or a
scene. In this paper, we explore the effective use of multi-layer feature
outputs of the deep parsing networks for spatial-semantic consistency by
designing a novel feature aggregation module to generate the appropriate global
representation prior, to improve the discriminative power of features. The
proposed module can auto-select the intermediate visual features to correlate
the spatial and semantic information. At the same time, the multiple skip
connections form a strong supervision, making the deep parsing network easy to
train. Extensive experiments on four public scene parsing datasets prove that
the deep parsing network equipped with the proposed feature aggregation module
can achieve very promising results
Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images
Recent progress of deep image classification models has provided great
potential to improve state-of-the-art performance in related computer vision
tasks. However, the transition to semantic segmentation is hampered by strict
memory limitations of contemporary GPUs. The extent of feature map caching
required by convolutional backprop poses significant challenges even for
moderately sized Pascal images, while requiring careful architectural
considerations when the source resolution is in the megapixel range. To address
these concerns, we propose a novel DenseNet-based ladder-style architecture
which features high modelling power and a very lean upsampling datapath. We
also propose to substantially reduce the extent of feature map caching by
exploiting inherent spatial efficiency of the DenseNet feature extractor. The
resulting models deliver high performance with fewer parameters than
competitive approaches, and allow training at megapixel resolution on commodity
hardware. The presented experimental results outperform the state-of-the-art in
terms of prediction accuracy and execution speed on Cityscapes, Pascal VOC
2012, CamVid and ROB 2018 datasets. Source code will be released upon
publication.Comment: 12 pages, 6 figures, under revie
Simultaneous Semantic Segmentation and Outlier Detection in Presence of Domain Shift
Recent success on realistic road driving datasets has increased interest in
exploring robust performance in real-world applications. One of the major
unsolved problems is to identify image content which can not be reliably
recognized with a given inference engine. We therefore study approaches to
recover a dense outlier map alongside the primary task with a single forward
pass, by relying on shared convolutional features. We consider semantic
segmentation as the primary task and perform extensive validation on WildDash
val (inliers), LSUN val (outliers), and pasted objects from Pascal VOC 2007
(outliers). We achieve the best validation performance by training to
discriminate inliers from pasted ImageNet-1k content, even though ImageNet-1k
contains many road-driving pixels, and, at least nominally, fails to account
for the full diversity of the visual world. The proposed two-head model
performs comparably to the C-way multi-class model trained to predict uniform
distribution in outliers, while outperforming several other validated
approaches. We evaluate our best two models on the WildDash test dataset and
set a new state of the art on the WildDash benchmark.Comment: Accepted to German Conference on Pattern Recognition 2019. 25 pages,
10 figures, 9 table
Seamless Scene Segmentation
In this work we introduce a novel, CNN-based architecture that can be trained
end-to-end to deliver seamless scene segmentation results. Our goal is to
predict consistent semantic segmentation and detection results by means of a
panoptic output format, going beyond the simple combination of independently
trained segmentation and detection models. The proposed architecture takes
advantage of a novel segmentation head that seamlessly integrates multi-scale
features generated by a Feature Pyramid Network with contextual information
conveyed by a light-weight DeepLab-like module. As additional contribution we
review the panoptic metric and propose an alternative that overcomes its
limitations when evaluating non-instance categories. Our proposed network
architecture yields state-of-the-art results on three challenging street-level
datasets, i.e. Cityscapes, Indian Driving Dataset and Mapillary Vistas.Comment: extended version of the accepted CVPR 2019 pape
Deep Neural Network Perception Models and Robust Autonomous Driving Systems
This paper analyzes the robustness of deep learning models in autonomous
driving applications and discusses the practical solutions to address that
Context-Integrated and Feature-Refined Network for Lightweight Object Parsing
Semantic segmentation for lightweight object parsing is a very challenging
task, because both accuracy and efficiency (e.g., execution speed, memory
footprint or computational complexity) should all be taken into account.
However, most previous works pay too much attention to one-sided perspective,
either accuracy or speed, and ignore others, which poses a great limitation to
actual demands of intelligent devices. To tackle this dilemma, we propose a
novel lightweight architecture named Context-Integrated and Feature-Refined
Network (CIFReNet). The core components of CIFReNet are the Long-skip
Refinement Module (LRM) and the Multi-scale Context Integration Module (MCIM).
The LRM is designed to ease the propagation of spatial information between
low-level and high-level stages. Furthermore, channel attention mechanism is
introduced into the process of long-skip learning to boost the quality of
low-level feature refinement. Meanwhile, the MCIM consists of three cascaded
Dense Semantic Pyramid (DSP) blocks with image-level features, which is
presented to encode multiple context information and enlarge the field of view.
Specifically, the proposed DSP block exploits a dense feature sampling strategy
to enhance the information representations without significantly increasing the
computation cost. Comprehensive experiments are conducted on three benchmark
datasets for object parsing including Cityscapes, CamVid, and Helen. As
indicated, the proposed method reaches a better trade-off between accuracy and
efficiency compared with the other state-of-the-art methods
Temporally Distributed Networks for Fast Video Semantic Segmentation
We present TDNet, a temporally distributed network designed for fast and
accurate video semantic segmentation. We observe that features extracted from a
certain high-level layer of a deep CNN can be approximated by composing
features extracted from several shallower sub-networks. Leveraging the inherent
temporal continuity in videos, we distribute these sub-networks over sequential
frames. Therefore, at each time step, we only need to perform a lightweight
computation to extract a sub-features group from a single sub-network. The full
features used for segmentation are then recomposed by application of a novel
attention propagation module that compensates for geometry deformation between
frames. A grouped knowledge distillation loss is also introduced to further
improve the representation power at both full and sub-feature levels.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method
achieves state-of-the-art accuracy with significantly faster speed and lower
latency.Comment: [CVPR2020] Project: https://github.com/feinanshan/TDNe
Improving Semantic Segmentation via Self-Training
Deep learning usually achieves the best results with complete supervision. In
the case of semantic segmentation, this means that large amounts of pixelwise
annotations are required to learn accurate models. In this paper, we show that
we can obtain state-of-the-art results using a semi-supervised approach,
specifically a self-training paradigm. We first train a teacher model on
labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels
jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets
while requiring significantly less supervision. We also demonstrate the
effectiveness of self-training on a challenging cross-domain generalization
task, outperforming conventional finetuning method by a large margin. Lastly,
to alleviate the computational burden caused by the large amount of pseudo
labels, we propose a fast training schedule to accelerate the training of
segmentation models by up to 2x without performance degradation