178 research outputs found
Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos
Most recent approaches for action recognition from video leverage deep
architectures to encode the video clip into a fixed length representation
vector that is then used for classification. For this to be successful, the
network must be capable of suppressing irrelevant scene background and extract
the representation from the most discriminative part of the video. Our
contribution builds on the observation that spatio-temporal patterns
characterizing actions in videos are highly correlated with objects and their
location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a
deep recurrent architecture with built-in spatial attention that performs
temporally aggregated VLAD encoding for action recognition from videos. We
adopt a top-down approach of attention, by using class specific activation maps
obtained from a deep CNN pre-trained for image classification, to weight
appearance features before encoding them into a fixed-length video descriptor
using Gated Recurrent Units. Our method achieves state of the art recognition
accuracy on HMDB51 and UCF101 benchmarks.Comment: Accepted to the 17th International Conference of the Italian
Association for Artificial Intelligenc
Deep Image Retrieval: A Survey
In recent years a vast amount of visual content has been generated and shared
from various fields, such as social media platforms, medical images, and
robotics. This abundance of content creation and sharing has introduced new
challenges. In particular, searching databases for similar content, i.e.content
based image retrieval (CBIR), is a long-established research area, and more
efficient and accurate methods are needed for real time retrieval. Artificial
intelligence has made progress in CBIR and has significantly facilitated the
process of intelligent search. In this survey we organize and review recent
CBIR works that are developed based on deep learning algorithms and techniques,
including insights and techniques from recent papers. We identify and present
the commonly-used benchmarks and evaluation methods used in the field. We
collect common challenges and propose promising future directions. More
specifically, we focus on image retrieval with deep learning and organize the
state of the art methods according to the types of deep network structure, deep
features, feature enhancement methods, and network fine-tuning strategies. Our
survey considers a wide variety of recent methods, aiming to promote a global
view of the field of instance-based CBIR.Comment: 20 pages, 11 figure
Attention-based Pyramid Aggregation Network for Visual Place Recognition
Visual place recognition is challenging in the urban environment and is
usually viewed as a large scale image retrieval task. The intrinsic challenges
in place recognition exist that the confusing objects such as cars and trees
frequently occur in the complex urban scene, and buildings with repetitive
structures may cause over-counting and the burstiness problem degrading the
image representations. To address these problems, we present an Attention-based
Pyramid Aggregation Network (APANet), which is trained in an end-to-end manner
for place recognition. One main component of APANet, the spatial pyramid
pooling, can effectively encode the multi-size buildings containing
geo-information. The other one, the attention block, is adopted as a region
evaluator for suppressing the confusing regional features while highlighting
the discriminative ones. When testing, we further propose a simple yet
effective PCA power whitening strategy, which significantly improves the widely
used PCA whitening by reasonably limiting the impact of over-counting.
Experimental evaluations demonstrate that the proposed APANet outperforms the
state-of-the-art methods on two place recognition benchmarks, and generalizes
well on standard image retrieval datasets.Comment: Accepted to ACM Multimedia 201
REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval
This paper addresses the problem of very large-scale image retrieval,
focusing on improving its accuracy and robustness. We target enhanced
robustness of search to factors such as variations in illumination, object
appearance and scale, partial occlusions, and cluttered backgrounds -
particularly important when search is performed across very large datasets with
significant variability. We propose a novel CNN-based global descriptor, called
REMAP, which learns and aggregates a hierarchy of deep features from multiple
CNN layers, and is trained end-to-end with a triplet loss. REMAP explicitly
learns discriminative features which are mutually-supportive and complementary
at various semantic levels of visual abstraction. These dense local features
are max-pooled spatially at each layer, within multi-scale overlapping regions,
before aggregation into a single image-level descriptor. To identify the
semantically useful regions and layers for retrieval, we propose to measure the
information gain of each region and layer using KL-divergence. Our system
effectively learns during training how useful various regions and layers are
and weights them accordingly. We show that such relative entropy-guided
aggregation outperforms classical CNN-based aggregation controlled by SGD. The
entire framework is trained in an end-to-end fashion, outperforming the latest
state-of-the-art results. On image retrieval datasets Holidays, Oxford and
MPEG, the REMAP descriptor achieves mAP of 95.5%, 91.5%, and 80.1%
respectively, outperforming any results published to date. REMAP also formed
the core of the winning submission to the Google Landmark Retrieval Challenge
on Kaggle.Comment: Submitted to IEEE Trans. Image Processing on 24 May 2018, published
22 May 201
- …