2,507 research outputs found
Attention guided global enhancement and local refinement network for semantic segmentation
The encoder-decoder architecture is widely used as a lightweight semantic
segmentation network. However, it struggles with a limited performance compared
to a well-designed Dilated-FCN model for two major problems. First, commonly
used upsampling methods in the decoder such as interpolation and deconvolution
suffer from a local receptive field, unable to encode global contexts. Second,
low-level features may bring noises to the network decoder through skip
connections for the inadequacy of semantic concepts in early encoder layers. To
tackle these challenges, a Global Enhancement Method is proposed to aggregate
global information from high-level feature maps and adaptively distribute them
to different decoder layers, alleviating the shortage of global contexts in the
upsampling process. Besides, a Local Refinement Module is developed by
utilizing the decoder features as the semantic guidance to refine the noisy
encoder features before the fusion of these two (the decoder features and the
encoder features). Then, the two methods are integrated into a Context Fusion
Block, and based on that, a novel Attention guided Global enhancement and Local
refinement Network (AGLN) is elaborately designed. Extensive experiments on
PASCAL Context, ADE20K, and PASCAL VOC 2012 datasets have demonstrated the
effectiveness of the proposed approach. In particular, with a vanilla
ResNet-101 backbone, AGLN achieves the state-of-the-art result (56.23% mean
IoU) on the PASCAL Context dataset. The code is available at
https://github.com/zhasen1996/AGLN.Comment: 12 pages, 6 figure
Enhanced Boundary Learning for Glass-like Object Segmentation
Glass-like objects such as windows, bottles, and mirrors exist widely in the
real world. Sensing these objects has many applications, including robot
navigation and grasping. However, this task is very challenging due to the
arbitrary scenes behind glass-like objects. This paper aims to solve the
glass-like object segmentation problem via enhanced boundary learning. In
particular, we first propose a novel refined differential module that outputs
finer boundary cues. We then introduce an edge-aware point-based graph
convolution network module to model the global shape along the boundary. We use
these two modules to design a decoder that generates accurate and clean
segmentation results, especially on the object contours. Both modules are
lightweight and effective: they can be embedded into various segmentation
models. In extensive experiments on three recent glass-like object segmentation
datasets, including Trans10k, MSD, and GDD, our approach establishes new
state-of-the-art results. We also illustrate the strong generalization
properties of our method on three generic segmentation datasets, including
Cityscapes, BDD, and COCO Stuff. Code and models is available at
\url{https://github.com/hehao13/EBLNet}.Comment: ICCV-2021 Code is availabe at https://github.com/hehao13/EBLNe
P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic Segmentation
Recently, Transformer-based models have achieved promising results in various
vision tasks, due to their ability to model long-range dependencies. However,
transformers are computationally expensive, which limits their applications in
real-time tasks such as autonomous driving. In addition, an efficient local and
global feature selection and fusion are vital for accurate dense prediction,
especially driving scene understanding tasks. In this paper, we propose a
real-time semantic segmentation architecture named Pyramid Pooling Axial
Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN
encoder to produce scale-aware contextual features, which are then combined
with the multi-level feature aggregation scheme to produce enhanced contextual
features. Specifically, we introduce a pyramid pooling axial transformer to
capture intricate spatial and channel dependencies, leading to improved
performance on semantic segmentation. Then, we design a Bidirectional Fusion
module (BiF) to combine semantic information at different levels. Meanwhile, a
Global Context Enhancer is introduced to compensate for the inadequacy of
concatenating different semantic levels. Finally, a decoder block is proposed
to help maintain a larger receptive field. We evaluate P2AT variants on three
challenging scene-understanding datasets. In particular, our P2AT variants
achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for
P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on
Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed
architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes.
The source code will be available a
Real-time Semantic Segmentation with Context Aggregation Network
With the increasing demand of autonomous systems, pixelwise semantic
segmentation for visual scene understanding needs to be not only accurate but
also efficient for potential real-time applications. In this paper, we propose
Context Aggregation Network, a dual branch convolutional neural network, with
significantly lower computational costs as compared to the state-of-the-art,
while maintaining a competitive prediction accuracy. Building upon the existing
dual branch architectures for high-speed semantic segmentation, we design a
cheap high resolution branch for effective spatial detailing and a context
branch with light-weight versions of global aggregation and local distribution
blocks, potent to capture both long-range and local contextual dependencies
required for accurate semantic segmentation, with low computational overheads.
We evaluate our method on two semantic segmentation datasets, namely Cityscapes
dataset and UAVid dataset. For Cityscapes test set, our model achieves
state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti
and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed
network achieves mIOU score of 63.5% with high execution speed (15 FPS).Comment: extended version of v
- …