2,884 research outputs found
Appearance Descriptors for Person Re-identification: a Comprehensive Review
In video-surveillance, person re-identification is the task of recognising
whether an individual has already been observed over a network of cameras.
Typically, this is achieved by exploiting the clothing appearance, as classical
biometric traits like the face are impractical in real-world video surveillance
scenarios. Clothing appearance is represented by means of low-level
\textit{local} and/or \textit{global} features of the image, usually extracted
according to some part-based body model to treat different body parts (e.g.
torso and legs) independently. This paper provides a comprehensive review of
current approaches to build appearance descriptors for person
re-identification. The most relevant techniques are described in detail, and
categorised according to the body models and features used. The aim of this
work is to provide a structured body of knowledge and a starting point for
researchers willing to conduct novel investigations on this challenging topic
Drought Stress Classification using 3D Plant Models
Quantification of physiological changes in plants can capture different
drought mechanisms and assist in selection of tolerant varieties in a high
throughput manner. In this context, an accurate 3D model of plant canopy
provides a reliable representation for drought stress characterization in
contrast to using 2D images. In this paper, we propose a novel end-to-end
pipeline including 3D reconstruction, segmentation and feature extraction,
leveraging deep neural networks at various stages, for drought stress study. To
overcome the high degree of self-similarities and self-occlusions in plant
canopy, prior knowledge of leaf shape based on features from deep siamese
network are used to construct an accurate 3D model using structure from motion
on wheat plants. The drought stress is characterized with a deep network based
feature aggregation. We compare the proposed methodology on several
descriptors, and show that the network outperforms conventional methods.Comment: Appears in Workshop on Computer Vision Problems in Plant Phenotyping
(CVPPP), International Conference on Computer Vision (ICCV) 201
DF-SLAM: A Deep-Learning Enhanced Visual SLAM System based on Deep Local Features
As the foundation of driverless vehicle and intelligent robots, Simultaneous
Localization and Mapping(SLAM) has attracted much attention these days.
However, non-geometric modules of traditional SLAM algorithms are limited by
data association tasks and have become a bottleneck preventing the development
of SLAM. To deal with such problems, many researchers seek to Deep Learning for
help. But most of these studies are limited to virtual datasets or specific
environments, and even sacrifice efficiency for accuracy. Thus, they are not
practical enough.
We propose DF-SLAM system that uses deep local feature descriptors obtained
by the neural network as a substitute for traditional hand-made features.
Experimental results demonstrate its improvements in efficiency and stability.
DF-SLAM outperforms popular traditional SLAM systems in various scenes,
including challenging scenes with intense illumination changes. Its versatility
and mobility fit well into the need for exploring new environments. Since we
adopt a shallow network to extract local descriptors and remain others the same
as original SLAM systems, our DF-SLAM can still run in real-time on GPU
A Performance Evaluation of Local Features for Image Based 3D Reconstruction
This paper performs a comprehensive and comparative evaluation of the state
of the art local features for the task of image based 3D reconstruction. The
evaluated local features cover the recently developed ones by using powerful
machine learning techniques and the elaborately designed handcrafted features.
To obtain a comprehensive evaluation, we choose to include both float type
features and binary ones. Meanwhile, two kinds of datasets have been used in
this evaluation. One is a dataset of many different scene types with
groundtruth 3D points, containing images of different scenes captured at fixed
positions, for quantitative performance evaluation of different local features
in the controlled image capturing situations. The other dataset contains
Internet scale image sets of several landmarks with a lot of unrelated images,
which is used for qualitative performance evaluation of different local
features in the free image collection situations. Our experimental results show
that binary features are competent to reconstruct scenes from controlled image
sequences with only a fraction of processing time compared to use float type
features. However, for the case of large scale image set with many distracting
images, float type features show a clear advantage over binary ones
From BoW to CNN: Two Decades of Texture Representation for Texture Classification
Texture is a fundamental characteristic of many types of images, and texture
representation is one of the essential and challenging problems in computer
vision and pattern recognition which has attracted extensive research
attention. Since 2000, texture representations based on Bag of Words (BoW) and
on Convolutional Neural Networks (CNNs) have been extensively studied with
impressive performance. Given this period of remarkable evolution, this paper
aims to present a comprehensive survey of advances in texture representation
over the last two decades. More than 200 major publications are cited in this
survey covering different aspects of the research, which includes (i) problem
description; (ii) recent advances in the broad categories of BoW-based,
CNN-based and attribute-based methods; and (iii) evaluation issues,
specifically benchmark datasets and state of the art results. In retrospect of
what has been achieved so far, the survey discusses open challenges and
directions for future research.Comment: Accepted by IJC
Orientation Driven Bag of Appearances for Person Re-identification
Person re-identification (re-id) consists of associating individual across
camera network, which is valuable for intelligent video surveillance and has
drawn wide attention. Although person re-identification research is making
progress, it still faces some challenges such as varying poses, illumination
and viewpoints. For feature representation in re-identification, existing works
usually use low-level descriptors which do not take full advantage of body
structure information, resulting in low representation ability.
%discrimination. To solve this problem, this paper proposes the mid-level
body-structure based feature representation (BSFR) which introduces body
structure pyramid for codebook learning and feature pooling in the vertical
direction of human body. Besides, varying viewpoints in the horizontal
direction of human body usually causes the data missing problem, , the
appearances obtained in different orientations of the identical person could
vary significantly. To address this problem, the orientation driven bag of
appearances (ODBoA) is proposed to utilize person orientation information
extracted by orientation estimation technic. To properly evaluate the proposed
approach, we introduce a new re-identification dataset (Market-1203) based on
the Market-1501 dataset and propose a new re-identification dataset (PKU-Reid).
Both datasets contain multiple images captured in different body orientations
for each person. Experimental results on three public datasets and two proposed
datasets demonstrate the superiority of the proposed approach, indicating the
effectiveness of body structure and orientation information for improving
re-identification performance.Comment: 13 pages, 15 figures, 3 tables, submitted to IEEE Transactions on
Circuits and Systems for Video Technolog
Learning a 3D descriptor for cross-source point cloud registration from synthetic data
As the development of 3D sensors, registration of 3D data (e.g. point cloud)
coming from different kind of sensor is dispensable and shows great demanding.
However, point cloud registration between different sensors is challenging
because of the variant of density, missing data, different viewpoint, noise and
outliers, and geometric transformation. In this paper, we propose a method to
learn a 3D descriptor for finding the correspondent relations between these
challenging point clouds. To train the deep learning framework, we use
synthetic 3D point cloud as input. Starting from synthetic dataset, we use
region-based sampling method to select reasonable, large and diverse training
samples from synthetic samples. Then, we use data augmentation to extend our
network be robust to rotation transformation. We focus our work on more general
cases that point clouds coming from different sensors, named cross-source point
cloud. The experiments show that our descriptor is not only able to generalize
to new scenes, but also generalize to different sensors. The results
demonstrate that the proposed method successfully aligns two 3D cross-source
point clouds which outperforms state-of-the-art method
Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation
How do computers and intelligent agents view the world around them? Feature
extraction and representation constitutes one the basic building blocks towards
answering this question. Traditionally, this has been done with carefully
engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is
no ``one size fits all'' approach that satisfies all requirements. In recent
years, the rising popularity of deep learning has resulted in a myriad of
end-to-end solutions to many computer vision problems. These approaches, while
successful, tend to lack scalability and can't easily exploit information
learned by other systems. Instead, we propose SAND features, a dedicated deep
learning solution to feature extraction capable of providing hierarchical
context information. This is achieved by employing sparse relative labels
indicating relationships of similarity/dissimilarity between image locations.
The nature of these labels results in an almost infinite set of dissimilar
examples to choose from. We demonstrate how the selection of negative examples
during training can be used to modify the feature space and vary it's
properties. To demonstrate the generality of this approach, we apply the
proposed features to a multitude of tasks, each requiring different properties.
This includes disparity estimation, semantic segmentation, self-localisation
and SLAM. In all cases, we show how incorporating SAND features results in
better or comparable results to the baseline, whilst requiring little to no
additional training. Code can be found at:
https://github.com/jspenmar/SAND_featuresComment: CVPR201
Data-Driven Shape Analysis and Processing
Data-driven methods play an increasingly important role in discovering
geometric, structural, and semantic relationships between 3D shapes in
collections, and applying this analysis to support intelligent modeling,
editing, and visualization of geometric data. In contrast to traditional
approaches, a key feature of data-driven approaches is that they aggregate
information from a collection of shapes to improve the analysis and processing
of individual shapes. In addition, they are able to learn models that reason
about properties and relationships of shapes without relying on hard-coded
rules or explicitly programmed instructions. We provide an overview of the main
concepts and components of these techniques, and discuss their application to
shape classification, segmentation, matching, reconstruction, modeling and
exploration, as well as scene analysis and synthesis, through reviewing the
literature and relating the existing works with both qualitative and numerical
comparisons. We conclude our report with ideas that can inspire future research
in data-driven shape analysis and processing.Comment: 10 pages, 19 figure
Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd
Object detection and 6D pose estimation in the crowd (scenes with multiple
object instances, severe foreground occlusions and background distractors), has
become an important problem in many rapidly evolving technological areas such
as robotics and augmented reality. Single shot-based 6D pose estimators with
manually designed features are still unable to tackle the above challenges,
motivating the research towards unsupervised feature learning and
next-best-view estimation. In this work, we present a complete framework for
both single shot-based 6D object pose estimation and next-best-view prediction
based on Hough Forests, the state of the art object pose estimator that
performs classification and regression jointly. Rather than using manually
designed features we a) propose an unsupervised feature learnt from
depth-invariant patches using a Sparse Autoencoder and b) offer an extensive
evaluation of various state of the art features. Furthermore, taking advantage
of the clustering performed in the leaf nodes of Hough Forests, we learn to
estimate the reduction of uncertainty in other views, formulating the problem
of selecting the next-best-view. To further improve pose estimation, we propose
an improved joint registration and hypotheses verification module as a final
refinement step to reject false detections. We provide two additional
challenging datasets inspired from realistic scenarios to extensively evaluate
the state of the art and our framework. One is related to domestic environments
and the other depicts a bin-picking scenario mostly found in industrial
settings. We show that our framework significantly outperforms state of the art
both on public and on our datasets.Comment: CVPR 2016 accepted paper, project page:
http://www.iis.ee.ic.ac.uk/rkouskou/6D_NBV.htm
- …