31 research outputs found
Unpaired Image Captioning via Scene Graph Alignments
Most of current image captioning models heavily rely on paired image-caption
datasets. However, getting large scale image-caption paired data is
labor-intensive and time-consuming. In this paper, we present a scene
graph-based approach for unpaired image captioning. Our framework comprises an
image scene graph generator, a sentence scene graph generator, a scene graph
encoder, and a sentence decoder. Specifically, we first train the scene graph
encoder and the sentence decoder on the text modality. To align the scene
graphs between images and sentences, we propose an unsupervised feature
alignment method that maps the scene graph features from the image to the
sentence modality. Experimental results show that our proposed model can
generate quite promising results without using any image-caption training
pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201
Semantic Segmentation with Labeling Uncertainty and Class Imbalance
Recently, methods based on Convolutional Neural Networks (CNN) achieved
impressive success in semantic segmentation tasks. However, challenges such as
the class imbalance and the uncertainty in the pixel-labeling process are not
completely addressed. As such, we present a new approach that calculates a
weight for each pixel considering its class and uncertainty during the labeling
process. The pixel-wise weights are used during training to increase or
decrease the importance of the pixels. Experimental results show that the
proposed approach leads to significant improvements in three challenging
segmentation tasks in comparison to baseline methods. It was also proved to be
more invariant to noise. The approach presented here may be used within a wide
range of semantic segmentation methods to improve their robustness.Comment: 15 pages, 9 figures, 3 table
AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation
In recent years, deep neural networks have achieved remarkable accuracy in
computer vision tasks. With inference time being a crucial factor, particularly
in dense prediction tasks such as semantic segmentation, knowledge distillation
has emerged as a successful technique for improving the accuracy of lightweight
student networks. The existing methods often neglect the information in
channels and among different classes. To overcome these limitations, this paper
proposes a novel method called Inter-Class Similarity Distillation (ICSD) for
the purpose of knowledge distillation. The proposed method transfers high-order
relations from the teacher network to the student network by independently
computing intra-class distributions for each class from network outputs. This
is followed by calculating inter-class similarity matrices for distillation
using KL divergence between distributions of each pair of classes. To further
improve the effectiveness of the proposed method, an Adaptive Loss Weighting
(ALW) training strategy is proposed. Unlike existing methods, the ALW strategy
gradually reduces the influence of the teacher network towards the end of
training process to account for errors in teacher's predictions. Extensive
experiments conducted on two well-known datasets for semantic segmentation,
Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed
method in terms of mIoU and pixel accuracy. The proposed method outperforms
most of existing knowledge distillation methods as demonstrated by both
quantitative and qualitative evaluations. Code is available at:
https://github.com/AmirMansurian/AICSDComment: 10 pages, 5 figures, 5 table
Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation
3D spatial information is known to be beneficial to the semantic segmentation
task. Most existing methods take 3D spatial data as an additional input,
leading to a two-stream segmentation network that processes RGB and 3D spatial
information separately. This solution greatly increases the inference time and
severely limits its scope for real-time applications. To solve this problem, we
propose Spatial information guided Convolution (S-Conv), which allows efficient
RGB feature and 3D spatial information integration. S-Conv is competent to
infer the sampling offset of the convolution kernel guided by the 3D spatial
information, helping the convolutional layer adjust the receptive field and
adapt to geometric transformations. S-Conv also incorporates geometric
information into the feature learning process by generating spatially adaptive
convolutional weights. The capability of perceiving geometry is largely
enhanced without much affecting the amount of parameters and computational
cost. We further embed S-Conv into a semantic segmentation network, called
Spatial information Guided convolutional Network (SGNet), resulting in
real-time inference and state-of-the-art performance on NYUDv2 and SUNRGBD
datasets
P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic Segmentation
Recently, Transformer-based models have achieved promising results in various
vision tasks, due to their ability to model long-range dependencies. However,
transformers are computationally expensive, which limits their applications in
real-time tasks such as autonomous driving. In addition, an efficient local and
global feature selection and fusion are vital for accurate dense prediction,
especially driving scene understanding tasks. In this paper, we propose a
real-time semantic segmentation architecture named Pyramid Pooling Axial
Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN
encoder to produce scale-aware contextual features, which are then combined
with the multi-level feature aggregation scheme to produce enhanced contextual
features. Specifically, we introduce a pyramid pooling axial transformer to
capture intricate spatial and channel dependencies, leading to improved
performance on semantic segmentation. Then, we design a Bidirectional Fusion
module (BiF) to combine semantic information at different levels. Meanwhile, a
Global Context Enhancer is introduced to compensate for the inadequacy of
concatenating different semantic levels. Finally, a decoder block is proposed
to help maintain a larger receptive field. We evaluate P2AT variants on three
challenging scene-understanding datasets. In particular, our P2AT variants
achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for
P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on
Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed
architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes.
The source code will be available a