16 research outputs found
An accurate retrieval through R-MAC+ descriptors for landmark recognition
The landmark recognition problem is far from being solved, but with the use
of features extracted from intermediate layers of Convolutional Neural Networks
(CNNs), excellent results have been obtained. In this work, we propose some
improvements on the creation of R-MAC descriptors in order to make the
newly-proposed R-MAC+ descriptors more representative than the previous ones.
However, the main contribution of this paper is a novel retrieval technique,
that exploits the fine representativeness of the MAC descriptors of the
database images. Using this descriptors called "db regions" during the
retrieval stage, the performance is greatly improved. The proposed method is
tested on different public datasets: Oxford5k, Paris6k and Holidays. It
outperforms the state-of-the- art results on Holidays and reached excellent
results on Oxford5k and Paris6k, overcame only by approaches based on
fine-tuning strategies
FastSal: a Computationally Efficient Network for Visual Saliency Prediction
This paper focuses on the problem of visual saliency prediction, predicting
regions of an image that tend to attract human visual attention, under a
constrained computational budget. We modify and test various recent efficient
convolutional neural network architectures like EfficientNet and MobileNetV2
and compare them with existing state-of-the-art saliency models such as SalGAN
and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS,
and in terms of the computational complexity and model size. We find that
MobileNetV2 makes an excellent backbone for a visual saliency model and can be
effective even without a complex decoder. We also show that knowledge transfer
from a more computationally expensive model like DeepGaze II can be achieved
via pseudo-labelling an unlabelled dataset, and that this approach gives result
on-par with many state-of-the-art algorithms with a fraction of the
computational cost and model size. Source code is available at
https://github.com/feiyanhu/FastSal
Utilising Visual Attention Cues for Vehicle Detection and Tracking
Advanced Driver-Assistance Systems (ADAS) have been attracting attention from
many researchers. Vision-based sensors are the closest way to emulate human
driver visual behavior while driving. In this paper, we explore possible ways
to use visual attention (saliency) for object detection and tracking. We
investigate: 1) How a visual attention map such as a \emph{subjectness}
attention or saliency map and an \emph{objectness} attention map can facilitate
region proposal generation in a 2-stage object detector; 2) How a visual
attention map can be used for tracking multiple objects. We propose a neural
network that can simultaneously detect objects as and generate objectness and
subjectness maps to save computational power. We further exploit the visual
attention map during tracking using a sequential Monte Carlo probability
hypothesis density (PHD) filter. The experiments are conducted on KITTI and
DETRAC datasets. The use of visual attention and hierarchical features has
shown a considerable improvement of 8\% in object detection which
effectively increased tracking performance by 4\% on KITTI dataset.Comment: Accepted in ICPR202
Attention-based Pyramid Aggregation Network for Visual Place Recognition
Visual place recognition is challenging in the urban environment and is
usually viewed as a large scale image retrieval task. The intrinsic challenges
in place recognition exist that the confusing objects such as cars and trees
frequently occur in the complex urban scene, and buildings with repetitive
structures may cause over-counting and the burstiness problem degrading the
image representations. To address these problems, we present an Attention-based
Pyramid Aggregation Network (APANet), which is trained in an end-to-end manner
for place recognition. One main component of APANet, the spatial pyramid
pooling, can effectively encode the multi-size buildings containing
geo-information. The other one, the attention block, is adopted as a region
evaluator for suppressing the confusing regional features while highlighting
the discriminative ones. When testing, we further propose a simple yet
effective PCA power whitening strategy, which significantly improves the widely
used PCA whitening by reasonably limiting the impact of over-counting.
Experimental evaluations demonstrate that the proposed APANet outperforms the
state-of-the-art methods on two place recognition benchmarks, and generalizes
well on standard image retrieval datasets.Comment: Accepted to ACM Multimedia 201
Utilising visual attention cues for vehicle detection and tracking
Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision-based sensors are the closest way to emulate human driver visual behaviour while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a subjectness attention or saliency map and an objectness attention map can facilitate region proposal generation in a 2- stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of ≈8% in object detection which effectively increased tracking performance by ≈4% on KITTI dataset
FastSal: a computationally efficient network for visual saliency prediction
This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract hu- man visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like Area Under Curve (AUC) and Normalized Scanpath Saliency (NSS), and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size
Benchmarking unsupervised near-duplicate image detection
Unsupervised near-duplicate detection has many practical applications ranging from social media analysis and web-scale retrieval, to digital image forensics. It entails running a threshold-limited query on a set of descriptors extracted from the images, with the goal of identifying all possible near-duplicates, while limiting the false positives due to visually similar images. Since the rate of false alarms grows with the dataset size, a very high specificity is thus required, up to 1-10^-9 for realistic use cases; this important requirement, however, is often overlooked in literature. In recent years, descriptors based on deep convolutional neural networks have matched or surpassed traditional feature extraction methods in content-based image retrieval tasks. To the best of our knowledge, ours is the first attempt to establish the performance range of deep learning-based descriptors for unsupervised near-duplicate detection on a range of datasets, encompassing a broad spectrum of near-duplicate definitions. We leverage both established and new benchmarks, such as the Mir-Flick Near-Duplicate (MFND) dataset, in which a known ground truth is provided for all possible pairs over a general, large scale image collection. To compare the specificity of different descriptors, we reduce the problem of unsupervised detection to that of binary classification of near-duplicate vs. not-near-duplicate images. The latter can be conveniently characterized using Receiver Operating Curve (ROC). Our findings in general favor the choice of fine-tuning deep convolutional networks, as opposed to using off-the-shelf features, but differences at high specificity settings depend on the dataset and are often small. The best performance was observed on the MFND benchmark, achieving 96% sensitivity at a false positive rate of 1.43x10^-6