Search CORE

11 research outputs found

Robust Semantic Segmentation with Ladder-DenseNet Models

Author: Bevandić Petra
Krešo Ivan
Oršić Marin
Šegvić Siniša
Publication venue
Publication date: 09/06/2018
Field of study

We present semantic segmentation experiments with a model capable to perform predictions on four benchmark datasets: Cityscapes, ScanNet, WildDash and KITTI. We employ a ladder-style convolutional architecture featuring a modified DenseNet-169 model in the downsampling datapath, and only one convolution in each stage of the upsampling datapath. Due to limited computing resources, we perform the training only on Cityscapes Fine train+val, ScanNet train, WildDash val and KITTI train. We evaluate the trained model on the test subsets of the four benchmarks in concordance with the guidelines of the Robust Vision Challenge ROB 2018. The performed experiments reveal several interesting findings which we describe and discuss.Comment: 4 pages, 4 figures, CVPR 2018 Robust Vision Challenge Worksho

arXiv.org e-Print Archive

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Author: Catanzaro Bryan
Newsam Shawn
Reda Fitsum A.
Sapra Karan
Shih Kevin J.
Tao Andrew
Zhu Yi
Publication venue
Publication date: 02/07/2019
Field of study

Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github.io/publication/2018-Segmentation.Comment: CVPR 2019 Oral. Code link: https://github.com/NVIDIA/semantic-segmentation. YouTube link: https://www.youtube.com/watch?v=aEbXjGZDZS

arXiv.org e-Print Archive

Multi-layer Feature Aggregation for Deep Scene Parsing Models

Author: Gao Yongsheng
Wu Qiang
Yu Litao
Zhang Jian
Zhou Jun
Publication venue
Publication date: 04/11/2020
Field of study

Scene parsing from images is a fundamental yet challenging problem in visual content understanding. In this dense prediction task, the parsing model assigns every pixel to a categorical label, which requires the contextual information of adjacent image patches. So the challenge for this learning task is to simultaneously describe the geometric and semantic properties of objects or a scene. In this paper, we explore the effective use of multi-layer feature outputs of the deep parsing networks for spatial-semantic consistency by designing a novel feature aggregation module to generate the appropriate global representation prior, to improve the discriminative power of features. The proposed module can auto-select the intermediate visual features to correlate the spatial and semantic information. At the same time, the multiple skip connections form a strong supervision, making the deep parsing network easy to train. Extensive experiments on four public scene parsing datasets prove that the deep parsing network equipped with the proposed feature aggregation module can achieve very promising results

arXiv.org e-Print Archive

Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images

Author: Krapac Josip
Krešo Ivan
Šegvić Siniša
Publication venue
Publication date: 14/05/2019
Field of study

Recent progress of deep image classification models has provided great potential to improve state-of-the-art performance in related computer vision tasks. However, the transition to semantic segmentation is hampered by strict memory limitations of contemporary GPUs. The extent of feature map caching required by convolutional backprop poses significant challenges even for moderately sized Pascal images, while requiring careful architectural considerations when the source resolution is in the megapixel range. To address these concerns, we propose a novel DenseNet-based ladder-style architecture which features high modelling power and a very lean upsampling datapath. We also propose to substantially reduce the extent of feature map caching by exploiting inherent spatial efficiency of the DenseNet feature extractor. The resulting models deliver high performance with fewer parameters than competitive approaches, and allow training at megapixel resolution on commodity hardware. The presented experimental results outperform the state-of-the-art in terms of prediction accuracy and execution speed on Cityscapes, Pascal VOC 2012, CamVid and ROB 2018 datasets. Source code will be released upon publication.Comment: 12 pages, 6 figures, under revie

arXiv.org e-Print Archive

Simultaneous Semantic Segmentation and Outlier Detection in Presence of Domain Shift

Author: Bevandić Petra
Krešo Ivan
Oršić Marin
Šegvić Siniša
Publication venue
Publication date: 02/08/2019
Field of study

Recent success on realistic road driving datasets has increased interest in exploring robust performance in real-world applications. One of the major unsolved problems is to identify image content which can not be reliably recognized with a given inference engine. We therefore study approaches to recover a dense outlier map alongside the primary task with a single forward pass, by relying on shared convolutional features. We consider semantic segmentation as the primary task and perform extensive validation on WildDash val (inliers), LSUN val (outliers), and pasted objects from Pascal VOC 2007 (outliers). We achieve the best validation performance by training to discriminate inliers from pasted ImageNet-1k content, even though ImageNet-1k contains many road-driving pixels, and, at least nominally, fails to account for the full diversity of the visual world. The proposed two-head model performs comparably to the C-way multi-class model trained to predict uniform distribution in outliers, while outperforming several other validated approaches. We evaluate our best two models on the WildDash test dataset and set a new state of the art on the WildDash benchmark.Comment: Accepted to German Conference on Pattern Recognition 2019. 25 pages, 10 figures, 9 table

arXiv.org e-Print Archive

Seamless Scene Segmentation

Author: Bulò Samuel Rota
Colovic Aleksander
Kontschieder Peter
Porzi Lorenzo
Publication venue
Publication date: 03/05/2019
Field of study

In this work we introduce a novel, CNN-based architecture that can be trained end-to-end to deliver seamless scene segmentation results. Our goal is to predict consistent semantic segmentation and detection results by means of a panoptic output format, going beyond the simple combination of independently trained segmentation and detection models. The proposed architecture takes advantage of a novel segmentation head that seamlessly integrates multi-scale features generated by a Feature Pyramid Network with contextual information conveyed by a light-weight DeepLab-like module. As additional contribution we review the panoptic metric and propose an alternative that overcomes its limitations when evaluating non-instance categories. Our proposed network architecture yields state-of-the-art results on three challenging street-level datasets, i.e. Cityscapes, Indian Driving Dataset and Mapillary Vistas.Comment: extended version of the accepted CVPR 2019 pape

arXiv.org e-Print Archive

Deep Neural Network Perception Models and Robust Autonomous Driving Systems

Author: Fieguth Paul
Jeddi Ahmadreza
Nazemi Amir
Shafiee Mohammad Javad
Wong Alexander
Publication venue
Publication date: 04/03/2020
Field of study

This paper analyzes the robustness of deep learning models in autonomous driving applications and discusses the practical solutions to address that

arXiv.org e-Print Archive

Context-Integrated and Feature-Refined Network for Lightweight Object Parsing

Author: Jiang Bin
Tu Wenxuan
Yang Chao
Yuan Junsong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2020
Field of study

Semantic segmentation for lightweight object parsing is a very challenging task, because both accuracy and efficiency (e.g., execution speed, memory footprint or computational complexity) should all be taken into account. However, most previous works pay too much attention to one-sided perspective, either accuracy or speed, and ignore others, which poses a great limitation to actual demands of intelligent devices. To tackle this dilemma, we propose a novel lightweight architecture named Context-Integrated and Feature-Refined Network (CIFReNet). The core components of CIFReNet are the Long-skip Refinement Module (LRM) and the Multi-scale Context Integration Module (MCIM). The LRM is designed to ease the propagation of spatial information between low-level and high-level stages. Furthermore, channel attention mechanism is introduced into the process of long-skip learning to boost the quality of low-level feature refinement. Meanwhile, the MCIM consists of three cascaded Dense Semantic Pyramid (DSP) blocks with image-level features, which is presented to encode multiple context information and enlarge the field of view. Specifically, the proposed DSP block exploits a dense feature sampling strategy to enhance the information representations without significantly increasing the computation cost. Comprehensive experiments are conducted on three benchmark datasets for object parsing including Cityscapes, CamVid, and Helen. As indicated, the proposed method reaches a better trade-off between accuracy and efficiency compared with the other state-of-the-art methods

arXiv.org e-Print Archive

Temporally Distributed Networks for Fast Video Semantic Segmentation

Author: Heilbron Fabian Caba
Hu Ping
Lin Zhe
Perazzi Federico
Sclaroff Stan
Wang Oliver
Publication venue
Publication date: 06/04/2020
Field of study

We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.Comment: [CVPR2020] Project: https://github.com/feinanshan/TDNe

arXiv.org e-Print Archive

Improving Semantic Segmentation via Self-Training

Author: He Tong
Li Mu
Manmatha R.
Smola Alexander
Wu Chongruo
Zhang Hang
Zhang Zhi
Zhang Zhongyue
Zhu Yi
Publication venue
Publication date: 06/05/2020
Field of study

Deep learning usually achieves the best results with complete supervision. In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models. In this paper, we show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm. We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data. Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets while requiring significantly less supervision. We also demonstrate the effectiveness of self-training on a challenging cross-domain generalization task, outperforming conventional finetuning method by a large margin. Lastly, to alleviate the computational burden caused by the large amount of pseudo labels, we propose a fast training schedule to accelerate the training of segmentation models by up to 2x without performance degradation

arXiv.org e-Print Archive