11,734 research outputs found
A Unified framework for local visual descriptors evaluation
International audienceLocal descriptors are the ground layer of recognition feature based systems for still images and video. We propose a new framework to explain local descriptors. This framework is based on the descriptors decomposition in three levels: primitive extraction, primitive coding and code aggregation. With this framework, we are able to explain most of the popular descriptors in the literature such as HOG, HOF, SURF. We propose two new projection methods based on approximation with oscillating functions basis (sinus and Legendre polynomials). Using our framework, we are able to extend usual descriptors by changing the code aggregation or adding new primitive coding method. The experiments are carried out on images (VOC 2007) and videos datasets (KTH, Hollywood2 and UCF11), and achieve equal or better performances than the literature
Effective Image Retrieval via Multilinear Multi-index Fusion
Multi-index fusion has demonstrated impressive performances in retrieval task
by integrating different visual representations in a unified framework.
However, previous works mainly consider propagating similarities via neighbor
structure, ignoring the high order information among different visual
representations. In this paper, we propose a new multi-index fusion scheme for
image retrieval. By formulating this procedure as a multilinear based
optimization problem, the complementary information hidden in different indexes
can be explored more thoroughly. Specially, we first build our multiple indexes
from various visual representations. Then a so-called index-specific functional
matrix, which aims to propagate similarities, is introduced for updating the
original index. The functional matrices are then optimized in a unified tensor
space to achieve a refinement, such that the relevant images can be pushed more
closer. The optimization problem can be efficiently solved by the augmented
Lagrangian method with theoretical convergence guarantee. Unlike the
traditional multi-index fusion scheme, our approach embeds the multi-index
subspace structure into the new indexes with sparse constraint, thus it has
little additional memory consumption in online query stage. Experimental
evaluation on three benchmark datasets reveals that the proposed approach
achieves the state-of-the-art performance, i.e., N-score 3.94 on UKBench, mAP
94.1\% on Holiday and 62.39\% on Market-1501.Comment: 12 page
Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation
Remote sensing (RS) image retrieval is of great significant for geological
information mining. Over the past two decades, a large amount of research on
this task has been carried out, which mainly focuses on the following three
core issues: feature extraction, similarity metric and relevance feedback. Due
to the complexity and multiformity of ground objects in high-resolution remote
sensing (HRRS) images, there is still room for improvement in the current
retrieval approaches. In this paper, we analyze the three core issues of RS
image retrieval and provide a comprehensive review on existing methods.
Furthermore, for the goal to advance the state-of-the-art in HRRS image
retrieval, we focus on the feature extraction issue and delve how to use
powerful deep representations to address this task. We conduct systematic
investigation on evaluating correlative factors that may affect the performance
of deep features. By optimizing each factor, we acquire remarkable retrieval
results on publicly available HRRS datasets. Finally, we explain the
experimental phenomenon in detail and draw conclusions according to our
analysis. Our work can serve as a guiding role for the research of
content-based RS image retrieval
LandmarkBoost: Efficient Visual Context Classifiers for Robust Localization
The growing popularity of autonomous systems creates a need for reliable and
efficient metric pose retrieval algorithms. Currently used approaches tend to
rely on nearest neighbor search of binary descriptors to perform the 2D-3D
matching and guarantee realtime capabilities on mobile platforms. These methods
struggle, however, with the growing size of the map, changes in viewpoint or
appearance, and visual aliasing present in the environment. The rigidly defined
descriptor patterns only capture a limited neighborhood of the keypoint and
completely ignore the overall visual context.
We propose LandmarkBoost - an approach that, in contrast to the conventional
2D-3D matching methods, casts the search problem as a landmark classification
task. We use a boosted classifier to classify landmark observations and
directly obtain correspondences as classifier scores. We also introduce a
formulation of visual context that is flexible, efficient to compute, and can
capture relationships in the entire image plane. The original binary
descriptors are augmented with contextual information and informative features
are selected by the boosting framework. Through detailed experiments, we
evaluate the retrieval quality and performance of LandmarkBoost, demonstrating
that it outperforms common state-of-the-art descriptor matching methods
A Review of Codebook Models in Patch-Based Visual Object Recognition
The codebook model-based approach, while ignoring any structural aspect in vision, nonetheless provides state-of-the-art performances on current datasets. The key role of a visual codebook is to provide a way to map the low-level features into a fixed-length vector in histogram space to which standard classifiers can be directly applied. The discriminative power of such a visual codebook determines the quality of the codebook model, whereas the size of the codebook controls the complexity of the model. Thus, the construction of a codebook is an important step which is usually done by cluster analysis. However, clustering is a process that retains regions of high density in a distribution and it follows that the resulting codebook need not have discriminant properties. This is also recognised as a computational bottleneck of such systems. In our recent work, we proposed a resource-allocating codebook, to constructing a discriminant codebook in a one-pass design procedure that slightly outperforms more traditional approaches at drastically reduced computing times. In this review we survey several approaches that have been proposed over the last decade with their use of feature detectors, descriptors, codebook construction schemes, choice of classifiers in recognising objects, and datasets that were used in evaluating the proposed methods
From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings
In this paper, we propose a novel approach for text classification based on
clustering word embeddings, inspired by the bag of visual words model, which is
widely used in computer vision. After each word in a collection of documents is
represented as word vector using a pre-trained word embeddings model, a k-means
algorithm is applied on the word vectors in order to obtain a fixed-size set of
clusters. The centroid of each cluster is interpreted as a super word embedding
that embodies all the semantically related word vectors in a certain region of
the embedding space. Every embedded word in the collection of documents is then
assigned to the nearest cluster centroid. In the end, each document is
represented as a bag of super word embeddings by computing the frequency of
each super word embedding in the respective document. We also diverge from the
idea of building a single vocabulary for the entire collection of documents,
and propose to build class-specific vocabularies for better performance. Using
this kind of representation, we report results on two text mining tasks, namely
text categorization by topic and polarity classification. On both tasks, our
model yields better performance than the standard bag of words.Comment: Accepted at KES 201
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Deep Structured Models For Group Activity Recognition
This paper presents a deep neural-network-based hierarchical graphical model
for individual and group activity recognition in surveillance scenes. Deep
networks are used to recognize the actions of individual people in a scene.
Next, a neural-network-based hierarchical graphical model refines the predicted
labels for each class by considering dependencies between the classes. This
refinement step mimics a message-passing step similar to inference in a
probabilistic graphical model. We show that this approach can be effective in
group activity recognition, with the deep graphical model improving recognition
rates over baseline methods
Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
Videos are inherently multimodal. This paper studies the problem of how to
fully exploit the abundant multimodal clues for improved video categorization.
We introduce a hybrid deep learning framework that integrates useful clues from
multiple modalities, including static spatial appearance information, motion
patterns within a short time window, audio information as well as long-range
temporal dynamics. More specifically, we utilize three Convolutional Neural
Networks (CNNs) operating on appearance, motion and audio signals to extract
their corresponding features. We then employ a feature fusion network to derive
a unified representation with an aim to capture the relationships among
features. Furthermore, to exploit the long-range temporal dynamics in videos,
we apply two Long Short Term Memory networks with extracted appearance and
motion features as inputs. Finally, we also propose to refine the prediction
scores by leveraging contextual relationships among video semantics. The hybrid
deep learning framework is able to exploit a comprehensive set of multimodal
features for video classification. Through an extensive set of experiments, we
demonstrate that (1) LSTM networks which model sequences in an explicitly
recurrent manner are highly complementary with CNN models; (2) the feature
fusion network which produces a fused representation through modeling feature
relationships outperforms alternative fusion strategies; (3) the semantic
context of video classes can help further refine the predictions for improved
performance. Experimental results on two challenging benchmarks, the UCF-101
and the Columbia Consumer Videos (CCV), provide strong quantitative evidence
that our framework achieves promising results: on the UCF-101 and
on the CCV, outperforming competing methods with clear margins
3D Registration of Aerial and Ground Robots for Disaster Response: An Evaluation of Features, Descriptors, and Transformation Estimation
Global registration of heterogeneous ground and aerial mapping data is a
challenging task. This is especially difficult in disaster response scenarios
when we have no prior information on the environment and cannot assume the
regular order of man-made environments or meaningful semantic cues. In this
work we extensively evaluate different approaches to globally register UGV
generated 3D point-cloud data from LiDAR sensors with UAV generated point-cloud
maps from vision sensors. The approaches are realizations of different
selections for: a) local features: key-points or segments; b) descriptors:
FPFH, SHOT, or ESF; and c) transformation estimations: RANSAC or FGR.
Additionally, we compare the results against standard approaches like applying
ICP after a good prior transformation has been given. The evaluation criteria
include the distance which a UGV needs to travel to successfully localize, the
registration error, and the computational cost. In this context, we report our
findings on effectively performing the task on two new Search and Rescue
datasets. Our results have the potential to help the community take informed
decisions when registering point-cloud maps from ground robots to those from
aerial robots.Comment: Awarded Best Paper at the 15th IEEE International Symposium on
Safety, Security, and Rescue Robotics 2017 (SSRR 2017
- …