51,013 research outputs found
Recurrent Soft Attention Model for Common Object Recognition
We propose the Recurrent Soft Attention Model, which integrates the visual
attention from the original image to a LSTM memory cell through a down-sample
network. The model recurrently transmits visual attention to the memory cells
for glimpse mask generation, which is a more natural way for attention
integration and exploitation in general object detection and recognition
problem. We test our model under the metric of the top-1 accuracy on the
CIFAR-10 dataset. The experiment shows that our down-sample network and
feedback mechanism plays an effective role among the whole network structure.Comment: 5 pages, 4 figure
Survey on the attention based RNN model and its applications in computer vision
The recurrent neural networks (RNN) can be used to solve the sequence to
sequence problem, where both the input and the output have sequential
structures. Usually there are some implicit relations between the structures.
However, it is hard for the common RNN model to fully explore the relations
between the sequences. In this survey, we introduce some attention based RNN
models which can focus on different parts of the input for each output item, in
order to explore and take advantage of the implicit relations between the input
and the output items. The different attention mechanisms are described in
detail. We then introduce some applications in computer vision which apply the
attention based RNN models. The superiority of the attention based RNN model is
shown by the experimental results. At last some future research directions are
given
Sequential Context Encoding for Duplicate Removal
Duplicate removal is a critical step to accomplish a reasonable amount of
predictions in prevalent proposal-based object detection frameworks. Albeit
simple and effective, most previous algorithms utilize a greedy process without
making sufficient use of properties of input data. In this work, we design a
new two-stage framework to effectively select the appropriate proposal
candidate for each object. The first stage suppresses most of easy negative
object proposals, while the second stage selects true positives in the reduced
proposal set. These two stages share the same network structure, \ie, an
encoder and a decoder formed as recurrent neural networks (RNN) with global
attention and context gate. The encoder scans proposal candidates in a
sequential manner to capture the global context information, which is then fed
to the decoder to extract optimal proposals. In our extensive experiments, the
proposed method outperforms other alternatives by a large margin.Comment: Accepted in NIPS 201
Order-Free RNN with Visual Attention for Multi-Label Classification
In this paper, we propose the joint learning attention and recurrent neural
network (RNN) models for multi-label classification. While approaches based on
the use of either model exist (e.g., for the task of image captioning),
training such existing network architectures typically require pre-defined
label sequences. For multi-label classification, it would be desirable to have
a robust inference process, so that the prediction error would not propagate
and thus affect the performance. Our proposed model uniquely integrates
attention and Long Short Term Memory (LSTM) models, which not only addresses
the above problem but also allows one to identify visual objects of interests
with varying sizes without the prior knowledge of particular label ordering.
More importantly, label co-occurrence information can be jointly exploited by
our LSTM model. Finally, by advancing the technique of beam search, prediction
of multiple labels can be efficiently achieved by our proposed network model.Comment: Accepted at 32nd AAAI Conference on Artificial Intelligence (AAAI-18
Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition
Attention-based learning for fine-grained image recognition remains a
challenging task, where most of the existing methods treat each object part in
isolation, while neglecting the correlations among them. In addition, the
multi-stage or multi-scale mechanisms involved make the existing methods less
efficient and hard to be trained end-to-end. In this paper, we propose a novel
attention-based convolutional neural network (CNN) which regulates multiple
object parts among different input images. Our method first learns multiple
attention region features of each input image through the one-squeeze
multi-excitation (OSME) module, and then apply the multi-attention multi-class
constraint (MAMC) in a metric learning framework. For each anchor feature, the
MAMC functions by pulling same-attention same-class features closer, while
pushing different-attention or different-class features away. Our method can be
easily trained end-to-end, and is highly efficient which requires only one
training stage. Moreover, we introduce Dogs-in-the-Wild, a comprehensive dog
species dataset that surpasses similar existing datasets by category coverage,
data volume and annotation quality. This dataset will be released upon
acceptance to facilitate the research of fine-grained image recognition.
Extensive experiments are conducted to show the substantial improvements of our
method on four benchmark datasets
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Inspired by recent work in machine translation and object detection, we
introduce an attention based model that automatically learns to describe the
content of images. We describe how we can train this model in a deterministic
manner using standard backpropagation techniques and stochastically by
maximizing a variational lower bound. We also show through visualization how
the model is able to automatically learn to fix its gaze on salient objects
while generating the corresponding words in the output sequence. We validate
the use of attention with state-of-the-art performance on three benchmark
datasets: Flickr8k, Flickr30k and MS COCO
Explainable Neural Computation via Stack Neural Module Networks
In complex inferential tasks like question answering, machine learning models
must confront two challenges: the need to implement a compositional reasoning
process, and, in many applications, the need for this reasoning process to be
interpretable to assist users in both development and prediction. Existing
models designed to produce interpretable traces of their decision-making
process typically require these traces to be supervised at training time. In
this paper, we present a novel neural modular approach that performs
compositional reasoning by automatically inducing a desired sub-task
decomposition without relying on strong supervision. Our model allows linking
different reasoning tasks though shared modules that handle common routines
across tasks. Experiments show that the model is more interpretable to human
evaluators compared to other state-of-the-art models: users can better
understand the model's underlying reasoning procedure and predict when it will
succeed or fail based on observing its intermediate outputs.Comment: ECCV 201
Exploring Models and Data for Remote Sensing Image Caption Generation
Inspired by recent development of artificial satellite, remote sensing images
have attracted extensive attention. Recently, noticeable progress has been made
in scene classification and target detection.However, it is still not clear how
to describe the remote sensing image content with accurate and concise
sentences. In this paper, we investigate to describe the remote sensing images
with accurate and flexible sentences. First, some annotated instructions are
presented to better describe the remote sensing images considering the special
characteristics of remote sensing images. Second, in order to exhaustively
exploit the contents of remote sensing images, a large-scale aerial image data
set is constructed for remote sensing image caption. Finally, a comprehensive
review is presented on the proposed data set to fully advance the task of
remote sensing caption. Extensive experiments on the proposed data set
demonstrate that the content of the remote sensing image can be completely
described by generating language descriptions. The data set is available at
https://github.com/201528014227051/RSICD_optimalComment: 14 pages, 8 figure
Describing Multimedia Content using Attention-based Encoder--Decoder Networks
Whereas deep neural networks were first mostly used for classification tasks,
they are rapidly expanding in the realm of structured output problems, where
the observed target is composed of multiple random variables that have a rich
joint distribution, given the input. We focus in this paper on the case where
the input also has a rich structure and the input and output structures are
somehow related. We describe systems that learn to attend to different places
in the input, for each element of the output, for a variety of tasks: machine
translation, image caption generation, video clip description and speech
recognition. All these systems are based on a shared set of building blocks:
gated recurrent neural networks and convolutional neural networks, along with
trained attention mechanisms. We report on experimental results with these
systems, showing impressively good performance and the advantage of the
attention mechanism.Comment: Submitted to IEEE Transactions on Multimedia Special Issue on Deep
Learning for Multimedia Computin
End-to-End Video Captioning
Building correspondences across different modalities, such as video and
language, has recently become critical in many visual recognition applications,
such as video captioning. Inspired by machine translation, recent models tackle
this task using an encoder-decoder strategy. The (video) encoder is
traditionally a Convolutional Neural Network (CNN), while the decoding (for
language generation) is done using a Recurrent Neural Network (RNN). Current
state-of-the-art methods, however, train encoder and decoder separately. CNNs
are pretrained on object and/or action recognition tasks and used to encode
video-level features. The decoder is then optimised on such static features to
generate the video's description. This disjoint setup is arguably sub-optimal
for input (video) to output (description) mapping. In this work, we propose to
optimise both encoder and decoder simultaneously in an end-to-end fashion. In a
two-stage training setting, we first initialise our architecture using
pre-trained encoders and decoders -- then, the entire network is trained
end-to-end in a fine-tuning stage to learn the most relevant features for video
caption generation. In our experiments, we use GoogLeNet and
Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a
decoder. Analogously to gains observed in other computer vision problems, we
show that end-to-end training significantly improves over the traditional,
disjoint training process. We evaluate our End-to-End (EtENet) Networks on the
Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT)
benchmark datasets, showing how EtENet achieves state-of-the-art performance
across the board.Comment: Accepted at Large Scale Holistic Video Understanding, ICCVW 201
- …