308 research outputs found
DeepPoint3D: Learning Discriminative Local Descriptors using Deep Metric Learning on 3D Point Clouds
Learning local descriptors is an important problem in computer vision. While
there are many techniques for learning local patch descriptors for 2D images,
recently efforts have been made for learning local descriptors for 3D points.
The recent progress towards solving this problem in 3D leverages the strong
feature representation capability of image based convolutional neural networks
by utilizing RGB-D or multi-view representations. However, in this paper, we
propose to learn 3D local descriptors by directly processing unstructured 3D
point clouds without needing any intermediate representation. The method
constitutes a deep network for learning permutation invariant representation of
3D points. To learn the local descriptors, we use a multi-margin contrastive
loss which discriminates between similar and dissimilar points on a surface
while also leveraging the extent of dissimilarity among the negative samples at
the time of training. With comprehensive evaluation against strong baselines,
we show that the proposed method outperforms state-of-the-art methods for
matching points in 3D point clouds. Further, we demonstrate the effectiveness
of the proposed method on various applications achieving state-of-the-art
results
Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions
Recent innovations in training deep convolutional neural network (ConvNet)
models have motivated the design of new methods to automatically learn local
image descriptors. The latest deep ConvNets proposed for this task consist of a
siamese network that is trained by penalising misclassification of pairs of
local image patches. Current results from machine learning show that replacing
this siamese by a triplet network can improve the classification accuracy in
several problems, but this has yet to be demonstrated for local image
descriptor learning. Moreover, current siamese and triplet networks have been
trained with stochastic gradient descent that computes the gradient from
individual pairs or triplets of local image patches, which can make them prone
to overfitting. In this paper, we first propose the use of triplet networks for
the problem of local image descriptor learning. Furthermore, we also propose
the use of a global loss that minimises the overall classification error in the
training set, which can improve the generalisation capability of the model.
Using the UBC benchmark dataset for comparing local image descriptors, we show
that the triplet network produces a more accurate embedding than the siamese
network in terms of the UBC dataset errors. Moreover, we also demonstrate that
a combination of the triplet and global losses produces the best embedding in
the field, using this triplet network. Finally, we also show that the use of
the central-surround siamese network trained with the global loss produces the
best result of the field on the UBC dataset. Pre-trained models are available
online at https://github.com/vijaykbg/deep-patchmatchComment: IEEE Conference on Computer Vision and Pattern Recognition 2016 (CVPR
2016
From handcrafted to deep local features
This paper presents an overview of the evolution of local features from
handcrafted to deep-learning-based methods, followed by a discussion of several
benchmarks and papers evaluating such local features. Our investigations are
motivated by 3D reconstruction problems, where the precise location of the
features is important. As we describe these methods, we highlight and explain
the challenges of feature extraction and potential ways to overcome them. We
first present handcrafted methods, followed by methods based on classical
machine learning and finally we discuss methods based on deep-learning. This
largely chronologically-ordered presentation will help the reader to fully
understand the topic of image and region description in order to make best use
of it in modern computer vision applications. In particular, understanding
handcrafted methods and their motivation can help to understand modern
approaches and how machine learning is used to improve the results. We also
provide references to most of the relevant literature and code.Comment: Preprin
A Decade Survey of Content Based Image Retrieval using Deep Learning
The content based image retrieval aims to find the similar images from a
large scale dataset against a query image. Generally, the similarity between
the representative features of the query image and dataset images is used to
rank the images for retrieval. In early days, various hand designed feature
descriptors have been investigated based on the visual cues such as color,
texture, shape, etc. that represent the images. However, the deep learning has
emerged as a dominating alternative of hand-designed feature engineering from a
decade. It learns the features automatically from the data. This paper presents
a comprehensive survey of deep learning based developments in the past decade
for content based image retrieval. The categorization of existing
state-of-the-art methods from different perspectives is also performed for
greater understanding of the progress. The taxonomy used in this survey covers
different supervision, different networks, different descriptor type and
different retrieval type. A performance analysis is also performed using the
state-of-the-art methods. The insights are also presented for the benefit of
the researchers to observe the progress and to make the best choices. The
survey presented in this paper will help in further research progress in image
retrieval using deep learning
Learning to Align Images using Weak Geometric Supervision
Image alignment tasks require accurate pixel correspondences, which are
usually recovered by matching local feature descriptors. Such descriptors are
often derived using supervised learning on existing datasets with ground truth
correspondences. However, the cost of creating such datasets is usually
prohibitive. In this paper, we propose a new approach to align two images
related by an unknown 2D homography where the local descriptor is learned from
scratch from the images and the homography is estimated simultaneously. Our key
insight is that a siamese convolutional neural network can be trained jointly
while iteratively updating the homography parameters by optimizing a single
loss function. Our method is currently weakly supervised because the input
images need to be roughly aligned.
We have used this method to align images of different modalities such as RGB
and near-infra-red (NIR) without using any prior labeled data. Images
automatically aligned by our method were then used to train descriptors that
generalize to new images. We also evaluated our method on RGB images. On the
HPatches benchmark, our method achieves comparable accuracy to deep local
descriptors that were trained offline in a supervised setting.Comment: Accepted in 3DV 201
Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives
This chapter presents recent advances in content based image search and
retrieval (CBIR) systems in remote sensing (RS) for fast and accurate
information discovery from massive data archives. Initially, we analyze the
limitations of the traditional CBIR systems that rely on the hand-crafted RS
image descriptors. Then, we focus our attention on the advances in RS CBIR
systems for which deep learning (DL) models are at the forefront. In
particular, we present the theoretical properties of the most recent DL based
CBIR systems for the characterization of the complex semantic content of RS
images. After discussing their strengths and limitations, we present the deep
hashing based CBIR systems that have high time-efficient search capability
within huge data archives. Finally, the most promising research directions in
RS CBIR are discussed.Comment: To appear as a book chapter in "Deep Learning for the Earth
Sciences", John Wiley & Sons, 202
Unsupervised learning from videos using temporal coherency deep networks
In this work we address the challenging problem of unsupervised learning from
videos. Existing methods utilize the spatio-temporal continuity in contiguous
video frames as regularization for the learning process. Typically, this
temporal coherence of close frames is used as a free form of annotation,
encouraging the learned representations to exhibit small differences between
these frames. But this type of approach fails to capture the dissimilarity
between videos with different content, hence learning less discriminative
features. We here propose two Siamese architectures for Convolutional Neural
Networks, and their corresponding novel loss functions, to learn from unlabeled
videos, which jointly exploit the local temporal coherence between contiguous
frames, and a global discriminative margin used to separate representations of
different videos. An extensive experimental evaluation is presented, where we
validate the proposed models on various tasks. First, we show how the learned
features can be used to discover actions and scenes in video collections.
Second, we show the benefits of such an unsupervised learning from just
unlabeled videos, which can be directly used as a prior for the supervised
recognition tasks of actions and objects in images, where our results further
show that our features can even surpass a traditional and heavily supervised
pre-training plus fine-tunning strategy
Survey on Deep Learning Techniques for Person Re-Identification Task
Intelligent video-surveillance is currently an active research field in
computer vision and machine learning techniques. It provides useful tools for
surveillance operators and forensic video investigators. Person
re-identification (PReID) is one among these tools. It consists of recognizing
whether an individual has already been observed over a camera in a network or
not. This tool can also be employed in various possible applications such as
off-line retrieval of all the video-sequences showing an individual of interest
whose image is given a query, and online pedestrian tracking over multiple
camera views. To this aim, many techniques have been proposed to increase the
performance of PReID. Among the systems, many researchers utilized deep neural
networks (DNNs) because of their better performance and fast execution at test
time. Our objective is to provide for future researchers the work being done on
PReID to date. Therefore, we summarized state-of-the-art DNN models being used
for this task. A brief description of each model along with their evaluation on
a set of benchmark datasets is given. Finally, a detailed comparison is
provided among these models followed by some limitations that can work as
guidelines for future research
Fine-tuning CNN Image Retrieval with No Human Annotation
Image descriptors based on activations of Convolutional Neural Networks
(CNNs) have become dominant in image retrieval due to their discriminative
power, compactness of representation, and search efficiency. Training of CNNs,
either from scratch or fine-tuning, requires a large amount of annotated data,
where a high quality of annotation is often crucial. In this work, we propose
to fine-tune CNNs for image retrieval on a large collection of unordered images
in a fully automated manner. Reconstructed 3D models obtained by the
state-of-the-art retrieval and structure-from-motion methods guide the
selection of the training data. We show that both hard-positive and
hard-negative examples, selected by exploiting the geometry and the camera
positions available from the 3D models, enhance the performance of
particular-object retrieval. CNN descriptor whitening discriminatively learned
from the same training data outperforms commonly used PCA whitening. We propose
a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and
average pooling and show that it boosts retrieval performance. Applying the
proposed method to the VGG network achieves state-of-the-art performance on the
standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.Comment: TPAMI 2018. arXiv admin note: substantial text overlap with
arXiv:1604.0242
Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly
Learning similarity functions between image pairs with deep neural networks
yields highly correlated activations of embeddings. In this work, we show how
to improve the robustness of such embeddings by exploiting the independence
within ensembles. To this end, we divide the last embedding layer of a deep
network into an embedding ensemble and formulate training this ensemble as an
online gradient boosting problem. Each learner receives a reweighted training
sample from the previous learners. Further, we propose two loss functions which
increase the diversity in our ensemble. These loss functions can be applied
either for weight initialization or during training. Together, our
contributions leverage large embedding sizes more effectively by significantly
reducing correlation of the embedding and consequently increase retrieval
accuracy of the embedding. Our method works with any differentiable loss
function and does not introduce any additional parameters during test time. We
evaluate our metric learning method on image retrieval tasks and show that it
improves over state-of-the-art methods on the CUB 200-2011, Cars-196, Stanford
Online Products, In-Shop Clothes Retrieval and VehicleID datasets.Comment: Extension to our paper BIER: Boosting Independent Embeddings Robustly
(ICCV 2017 oral) - submitted to PAM
- …