573 research outputs found
Deep Networks with Internal Selective Attention through Feedback Connections
Traditional convolutional neural networks (CNN) are stationary and
feedforward. They neither change their parameters during evaluation nor use
feedback from higher to lower layers. Real brains, however, do. So does our
Deep Attention Selective Network (dasNet) architecture. DasNets feedback
structure can dynamically alter its convolutional filter sensitivities during
classification. It harnesses the power of sequential processing to improve
classification performance, by allowing the network to iteratively focus its
internal attention on some of its convolutional filters. Feedback is trained
through direct policy search in a huge million-dimensional parameter space,
through scalable natural evolution strategies (SNES). On the CIFAR-10 and
CIFAR-100 datasets, dasNet outperforms the previous state-of-the-art model.Comment: 13 pages, 3 figure
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
Relational Long Short-Term Memory for Video Action Recognition
Spatial and temporal relationships, both short-range and long-range, between
objects in videos, are key cues for recognizing actions. It is a challenging
problem to model them jointly. In this paper, we first present a new variant of
Long Short-Term Memory, namely Relational LSTM, to address the challenge of
relation reasoning across space and time between objects. In our Relational
LSTM module, we utilize a non-local operation similar in spirit to the recently
proposed non-local network to substitute the fully connected operation in the
vanilla LSTM. By doing this, our Relational LSTM is capable of capturing long
and short-range spatio-temporal relations between objects in videos in a
principled way. Then, we propose a two-branch neural architecture consisting of
the Relational LSTM module as the non-local branch and a spatio-temporal
pooling based local branch. The local branch is utilized for capturing local
spatial appearance and/or short-term motion features. The two branches are
concatenated to learn video-level features from snippet-level ones which are
then used for classification. Experimental results on UCF-101 and HMDB-51
datasets show that our model achieves state-of-the-art results among LSTM-based
methods, while obtaining comparable performance with other state-of-the-art
methods (which use not directly comparable schema). Further, on the more
complex large-scale Charades dataset, we obtain a large 3.2% gain over
state-of-the-art methods, verifying the effectiveness of our method in complex
understanding
Deep Discriminative Representation Learning with Attention Map for Scene Classification
Learning powerful discriminative features for remote sensing image scene
classification is a challenging computer vision problem. In the past, most
classification approaches were based on handcrafted features. However, most
recent approaches to remote sensing scene classification are based on
Convolutional Neural Networks (CNNs). The de facto practice when learning these
CNN models is only to use original RGB patches as input with training performed
on large amounts of labeled data (ImageNet). In this paper, we show class
activation map (CAM) encoded CNN models, codenamed DDRL-AM, trained using
original RGB patches and attention map based class information provide
complementary information to the standard RGB deep models. To the best of our
knowledge, we are the first to investigate attention information encoded CNNs.
Additionally, to enhance the discriminability, we further employ a recently
developed object function called "center loss," which has proved to be very
useful in face recognition. Finally, our framework provides attention guidance
to the model in an end-to-end fashion. Extensive experiments on two benchmark
datasets show that our approach matches or exceeds the performance of other
methods
Parallel Separable 3D Convolution for Video and Volumetric Data Understanding
For video and volumetric data understanding, 3D convolution layers are widely
used in deep learning, however, at the cost of increasing computation and
training time. Recent works seek to replace the 3D convolution layer with
convolution blocks, e.g. structured combinations of 2D and 1D convolution
layers. In this paper, we propose a novel convolution block, Parallel Separable
3D Convolution (PmSCn), which applies m parallel streams of n 2D and one 1D
convolution layers along different dimensions. We first mathematically justify
the need of parallel streams (Pm) to replace a single 3D convolution layer
through tensor decomposition. Then we jointly replace consecutive 3D
convolution layers, common in modern network architectures, with the multiple
2D convolution layers (Cn). Lastly, we empirically show that PmSCn is
applicable to different backbone architectures, such as ResNet, DenseNet, and
UNet, for different applications, such as video action recognition, MRI brain
segmentation, and electron microscopy segmentation. In all three applications,
we replace the 3D convolution layers in state-of-the art models with PmSCn and
achieve around 14% improvement in test performance and 40% reduction in model
size and on average
Multiple Attentional Pyramid Networks for Chinese Herbal Recognition
Chinese herbs play a critical role in Traditional Chinese Medicine. Due to
different recognition granularity, they can be recognized accurately only by
professionals with much experience. It is expected that they can be recognized
automatically using new techniques like machine learning. However, there is no
Chinese herbal image dataset available. Simultaneously, there is no machine
learning method which can deal with Chinese herbal image recognition well.
Therefore, this paper begins with building a new standard Chinese-Herbs
dataset. Subsequently, a new Attentional Pyramid Networks (APN) for Chinese
herbal recognition is proposed, where both novel competitive attention and
spatial collaborative attention are proposed and then applied. APN can
adaptively model Chinese herbal images with different feature scales. Finally,
a new framework for Chinese herbal recognition is proposed as a new application
of APN. Experiments are conducted on our constructed dataset and validate the
effectiveness of our methods.Comment: 14 pages, 8 figure
SCAN: Self-and-Collaborative Attention Network for Video Person Re-identification
Video person re-identification attracts much attention in recent years. It
aims to match image sequences of pedestrians from different camera views.
Previous approaches usually improve this task from three aspects, including a)
selecting more discriminative frames, b) generating more informative temporal
representations, and c) developing more effective distance metrics. To address
the above issues, we present a novel and practical deep architecture for video
person re-identification termed Self-and-Collaborative Attention Network
(SCAN). It has several appealing properties. First, SCAN adopts non-parametric
attention mechanism to refine the intra-sequence and inter-sequence feature
representation of videos, and outputs self-and-collaborative feature
representation for each video, making the discriminative frames aligned between
the probe and gallery sequences.Second, beyond existing models, a generalized
pairwise similarity measurement is proposed to calculate the similarity feature
representations of video pairs, enabling computing the matching scores by the
binary classifier. Third, a dense clip segmentation strategy is also introduced
to generate rich probe-gallery pairs to optimize the model. Extensive
experiments demonstrate the effectiveness of SCAN, which outperforms the
best-performing baselines on iLIDS-VID, PRID2011 and MARS dataset,
respectively.Comment: 10 pages, 5 figure
MARS: Memory Attention-Aware Recommender System
In this paper, we study the problem of modeling users' diverse interests.
Previous methods usually learn a fixed user representation, which has a limited
ability to represent distinct interests of a user. In order to model users'
various interests, we propose a Memory Attention-aware Recommender System
(MARS). MARS utilizes a memory component and a novel attentional mechanism to
learn deep \textit{adaptive user representations}. Trained in an end-to-end
fashion, MARS adaptively summarizes users' interests. In the experiments, MARS
outperforms seven state-of-the-art methods on three real-world datasets in
terms of recall and mean average precision. We also demonstrate that MARS has a
great interpretability to explain its recommendation results, which is
important in many recommendation scenarios
Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN
3D Convolutional Neural Network (3D CNN) captures spatial and temporal
information on 3D data such as video sequences. However, due to the convolution
and pooling mechanism, the information loss seems unavoidable. To improve the
visual explanations and classification in 3D CNN, we propose two approaches; i)
aggregate layer-wise global to local (global-local) discrete gradients using
trained 3DResNext network, and ii) implement attention gating network to
improve the accuracy of the action recognition. The proposed approach intends
to show the usefulness of every layer termed as global-local attention in 3D
CNN via visual attribution, weakly-supervised action localization, and action
recognition. Firstly, the 3DResNext is trained and applied for action
classification using backpropagation concerning the maximum predicted class.
The gradients and activations of every layer are then up-sampled. Later,
aggregation is used to produce more nuanced attention, which points out the
most critical part of the predicted class's input videos. We use contour
thresholding of final attention for final localization. We evaluate spatial and
temporal action localization in trimmed videos using fine-grained visual
explanation via 3DCam. Experimental results show that the proposed approach
produces informative visual explanations and discriminative attention.
Furthermore, the action recognition via attention gating on each layer produces
better classification results than the baseline model
Generalize Symbolic Knowledge With Neural Rule Engine
As neural networks have dominated the state-of-the-art results in a wide
range of NLP tasks, it attracts considerable attention to improve the
performance of neural models by integrating symbolic knowledge. Different from
existing works, this paper investigates the combination of these two powerful
paradigms from the knowledge-driven side. We propose Neural Rule Engine (NRE),
which can learn knowledge explicitly from logic rules and then generalize them
implicitly with neural networks. NRE is implemented with neural module networks
in which each module represents an action of a logic rule. The experiments show
that NRE could greatly improve the generalization abilities of logic rules with
a significant increase in recall. Meanwhile, the precision is still maintained
at a high level
- …