33 research outputs found
Compact Generalized Non-local Network
The non-local module is designed for capturing long-range spatio-temporal
dependencies in images and videos. Although having shown excellent performance,
it lacks the mechanism to model the interactions between positions across
channels, which are of vital importance in recognizing fine-grained objects and
actions. To address this limitation, we generalize the non-local module and
take the correlations between the positions of any two channels into account.
This extension utilizes the compact representation for multiple kernel
functions with Taylor expansion that makes the generalized non-local module in
a fast and low-complexity computation flow. Moreover, we implement our
generalized non-local method within channel groups to ease the optimization.
Experimental results illustrate the clear-cut improvements and practical
applicability of the generalized non-local module on both fine-grained object
recognition and video classification. Code is available at:
https://github.com/KaiyuYue/cgnl-network.pytorch.Comment: Technical report; To appear at NIPS 2018; Code is available at
https://github.com/KaiyuYue/cgnl-network.pytorc
Efficient Coarse-to-Fine Non-Local Module for the Detection of Small Objects
An image is not just a collection of objects, but rather a graph where each
object is related to other objects through spatial and semantic relations.
Using relational reasoning modules, such as the non-local module
\cite{wang2017non}, can therefore improve object detection. Current schemes
apply such dedicated modules either to a specific layer of the bottom-up
stream, or between already-detected objects. We show that the relational
process can be better modeled in a coarse-to-fine manner and present a novel
framework, applying a non-local module sequentially to increasing resolution
feature maps along the top-down stream. In this way, information can naturally
passed from larger objects to smaller related ones. Applying the module to fine
feature maps further allows the information to pass between the small objects
themselves, exploiting repetitions of instances of the same class. In practice,
due to the expensive memory utilization of the non-local module, it is
infeasible to apply the module as currently used to high-resolution feature
maps. We redesigned the non local module, improved it in terms of memory and
number of operations, allowing it to be placed anywhere along the network. We
further incorporated relative spatial information into the module, in a manner
that can be incorporated into our efficient implementation. We show the
effectiveness of our scheme by improving the results of detecting small objects
on COCO by 1-2 AP points over Faster and Mask RCNN and by 1 AP over using
non-local module on the bottom-up stream
LatentGNN: Learning Efficient Non-local Relations for Visual Recognition
Capturing long-range dependencies in feature representations is crucial for
many visual recognition tasks. Despite recent successes of deep convolutional
networks, it remains challenging to model non-local context relations between
visual features. A promising strategy is to model the feature context by a
fully-connected graph neural network (GNN), which augments traditional
convolutional features with an estimated non-local context representation.
However, most GNN-based approaches require computing a dense graph affinity
matrix and hence have difficulty in scaling up to tackle complex real-world
visual problems. In this work, we propose an efficient and yet flexible
non-local relation representation based on a novel class of graph neural
networks. Our key idea is to introduce a latent space to reduce the complexity
of graph, which allows us to use a low-rank representation for the graph
affinity matrix and to achieve a linear complexity in computation. Extensive
experimental evaluations on three major visual recognition tasks show that our
method outperforms the prior works with a large margin while maintaining a low
computation cost.Comment: ICML 201
Toward Interpretable Music Tagging with Self-Attention
Self-attention is an attention mechanism that learns a representation by
relating different positions in the sequence. The transformer, which is a
sequence model solely based on self-attention, and its variants achieved
state-of-the-art results in many natural language processing tasks. Since music
composes its semantics based on the relations between components in sparse
positions, adopting the self-attention mechanism to solve music information
retrieval (MIR) problems can be beneficial. Hence, we propose a self-attention
based deep sequence model for music tagging. The proposed architecture consists
of shallow convolutional layers followed by stacked Transformer encoders.
Compared to conventional approaches using fully convolutional or recurrent
neural networks, our model is more interpretable while reporting competitive
results. We validate the performance of our model with the MagnaTagATune and
the Million Song Dataset. In addition, we demonstrate the interpretability of
the proposed architecture with a heat map visualization.Comment: 13 pages, 12 figures; code:
https://github.com/minzwon/self-attention-music-taggin
Associating Multi-Scale Receptive Fields for Fine-grained Recognition
Extracting and fusing part features have become the key of fined-grained
image recognition. Recently, Non-local (NL) module has shown excellent
improvement in image recognition. However, it lacks the mechanism to model the
interactions between multi-scale part features, which is vital for fine-grained
recognition. In this paper, we propose a novel cross-layer non-local (CNL)
module to associate multi-scale receptive fields by two operations. First, CNL
computes correlations between features of a query layer and all response
layers. Second, all response features are weighted according to the
correlations and are added to the query features. Due to the interactions of
cross-layer features, our model builds spatial dependencies among multi-level
layers and learns more discriminative features. In addition, we can reduce the
aggregation cost if we set low-dimensional deep layer as query layer.
Experiments are conducted to show our model achieves or surpasses
state-of-the-art results on three benchmark datasets of fine-grained
classification. Our codes can be found at github.com/FouriYe/CNL-ICIP2020.Comment: Accepted by ICIP202
Hierarchical Multi-Scale Attention for Semantic Segmentation
Multi-scale inference is commonly used to improve the results of semantic
segmentation. Multiple images scales are passed through a network and then the
results are combined with averaging or max pooling. In this work, we present an
attention-based approach to combining multi-scale predictions. We show that
predictions at certain scales are better at resolving particular failures
modes, and that the network learns to favor those scales for such cases in
order to generate better predictions. Our attention mechanism is hierarchical,
which enables it to be roughly 4x more memory efficient to train than other
recent approaches. In addition to enabling faster training, this allows us to
train with larger crop sizes which leads to greater model accuracy. We
demonstrate the result of our method on two datasets: Cityscapes and Mapillary
Vistas. For Cityscapes, which has a large number of weakly labelled images, we
also leverage auto-labelling to improve generalization. Using our approach we
achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and
Cityscapes (85.1 IOU test).Comment: 11 pages, 5 figure
Low-shot Object Detection via Classification Refinement
This work aims to address the problem of low-shot object detection, where
only a few training samples are available for each category. Regarding the fact
that conventional fully supervised approaches usually suffer huge performance
drop with rare classes where data is insufficient, our study reveals that there
exists more serious misalignment between classification confidence and
localization accuracy on rarely labeled categories, and the prone to
overfitting class-specific parameters is the crucial cause of this issue. In
this paper, we propose a novel low-shot classification correction network
(LSCN) which can be adopted into any anchor-based detector to directly enhance
the detection accuracy on data-rare categories, without sacrificing the
performance on base categories. Specially, we sample false positive proposals
from a base detector to train a separate classification correction network.
During inference, the well-trained correction network removes false positives
from the base detector. The proposed correction network is data-efficient yet
highly effective with four carefully designed components, which are Unified
recognition, Global receptive field, Inter-class separation, and Confidence
calibration. Experiments show our proposed method can bring significant
performance gains to rarely labeled categories and outperforms previous work on
COCO and PASCAL VOC by a large margin.Comment: Submitted to NIPS202
Global Aggregation then Local Distribution in Fully Convolutional Networks
It has been widely proven that modelling long-range dependencies in fully
convolutional networks (FCNs) via global aggregation modules is critical for
complex scene understanding tasks such as semantic segmentation and object
detection. However, global aggregation is often dominated by features of large
patterns and tends to oversmooth regions that contain small patterns (e.g.,
boundaries and small objects). To resolve this problem, we propose to first use
\emph{Global Aggregation} and then \emph{Local Distribution}, which is called
GALD, where long-range dependencies are more confidently used inside large
pattern regions and vice versa. The size of each pattern at each position is
estimated in the network as a per-channel mask map. GALD is end-to-end
trainable and can be easily plugged into existing FCNs with various global
aggregation modules for a wide range of vision tasks, and consistently improves
the performance of state-of-the-art object detection and instance segmentation
approaches. In particular, GALD used in semantic segmentation achieves new
state-of-the-art performance on Cityscapes test set with mIoU 83.3\%. Code is
available at: \url{https://github.com/lxtGH/GALD-Net}Comment: accepted at BMVC 201
Fine-Grained Attention for Weakly Supervised Object Localization
Although recent advances in deep learning accelerated an improvement in a
weakly supervised object localization (WSOL) task, there are still challenges
to identify the entire body of an object, rather than only discriminative
parts. In this paper, we propose a novel residual fine-grained attention (RFGA)
module that autonomously excites the less activated regions of an object by
utilizing information distributed over channels and locations within feature
maps in combination with a residual operation. To be specific, we devise a
series of mechanisms of triple-view attention representation, attention
expansion, and feature calibration. Unlike other attention-based WSOL methods
that learn a coarse attention map, having the same values across elements in
feature maps, our proposed RFGA learns fine-grained values in an attention map
by assigning different attention values for each of the elements. We validated
the superiority of our proposed RFGA module by comparing it with the recent
methods in the literature over three datasets. Further, we analyzed the effect
of each mechanism in our RFGA and visualized attention maps to get insights.Comment: 16 pages, 11 figure
Real-time Semantic Segmentation with Fast Attention
In deep CNN based models for semantic segmentation, high accuracy relies on
rich spatial context (large receptive fields) and fine spatial details (high
resolution), both of which incur high computational costs. In this paper, we
propose a novel architecture that addresses both challenges and achieves
state-of-the-art performance for semantic segmentation of high-resolution
images and videos in real-time. The proposed architecture relies on our fast
spatial attention, which is a simple yet efficient modification of the popular
self-attention mechanism and captures the same rich spatial context at a small
fraction of the computational cost, by changing the order of operations.
Moreover, to efficiently process high-resolution input, we apply an additional
spatial reduction to intermediate feature stages of the network with minimal
loss in accuracy thanks to the use of the fast attention module to fuse
features. We validate our method with a series of experiments, and show that
results on multiple datasets demonstrate superior performance with better
accuracy and speed compared to existing approaches for real-time semantic
segmentation. On Cityscapes, our network achieves 74.4 mIoU at 72 FPS and
75.5 mIoU at 58 FPS on a single Titan X GPU, which is~50 faster
than the state-of-the-art while retaining the same accuracy.Comment: project page: https://cs-people.bu.edu/pinghu/FANet.htm