4,155 research outputs found
Self-Taught Hashing for Fast Similarity Search
The ability of fast similarity search at large scale is of great importance
to many Information Retrieval (IR) applications. A promising way to accelerate
similarity search is semantic hashing which designs compact binary codes for a
large number of documents so that semantically similar documents are mapped to
similar codes (within a short Hamming distance). Although some recently
proposed techniques are able to generate high-quality codes for documents known
in advance, obtaining the codes for previously unseen documents remains to be a
very challenging problem. In this paper, we emphasise this issue and propose a
novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the
optimal -bit binary codes for all documents in the given corpus via
unsupervised learning, and then train classifiers via supervised learning
to predict the -bit code for any query document unseen before. Our
experiments on three real-world text datasets show that the proposed approach
using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine
(SVM) outperforms state-of-the-art techniques significantly
SHOE: Supervised Hashing with Output Embeddings
We present a supervised binary encoding scheme for image retrieval that
learns projections by taking into account similarity between classes obtained
from output embeddings. Our motivation is that binary hash codes learned in
this way improve both the visual quality of retrieval results and existing
supervised hashing schemes. We employ a sequential greedy optimization that
learns relationship aware projections by minimizing the difference between
inner products of binary codes and output embedding vectors. We develop a joint
optimization framework to learn projections which improve the accuracy of
supervised hashing over the current state of the art with respect to standard
and sibling evaluation metrics. We further boost performance by applying the
supervised dimensionality reduction technique on kernelized input CNN features.
Experiments are performed on three datasets: CUB-2011, SUN-Attribute and
ImageNet ILSVRC 2010. As a by-product of our method, we show that using a
simple k-nn pooling classifier with our discriminative codes improves over the
complex classification models on fine grained datasets like CUB and offer an
impressive compression ratio of 1024 on CNN features
Machine Learning Techniques and Applications For Ground-based Image Analysis
Ground-based whole sky cameras have opened up new opportunities for
monitoring the earth's atmosphere. These cameras are an important complement to
satellite images by providing geoscientists with cheaper, faster, and more
localized data. The images captured by whole sky imagers can have high spatial
and temporal resolution, which is an important pre-requisite for applications
such as solar energy modeling, cloud attenuation analysis, local weather
prediction, etc.
Extracting valuable information from the huge amount of image data by
detecting and analyzing the various entities in these images is challenging.
However, powerful machine learning techniques have become available to aid with
the image analysis. This article provides a detailed walk-through of recent
developments in these techniques and their applications in ground-based
imaging. We aim to bridge the gap between computer vision and remote sensing
with the help of illustrative examples. We demonstrate the advantages of using
machine learning techniques in ground-based image analysis via three primary
applications -- segmentation, classification, and denoising
Scalable Similarity Learning using Large Margin Neighborhood Embedding
Classifying large-scale image data into object categories is an important
problem that has received increasing research attention. Given the huge amount
of data, non-parametric approaches such as nearest neighbor classifiers have
shown promising results, especially when they are underpinned by a learned
distance or similarity measurement. Although metric learning has been well
studied in the past decades, most existing algorithms are impractical to handle
large-scale data sets. In this paper, we present an image similarity learning
method that can scale well in both the number of images and the dimensionality
of image descriptors. To this end, similarity comparison is restricted to each
sample's local neighbors and a discriminative similarity measure is induced
from large margin neighborhood embedding. We also exploit the ensemble of
projections so that high-dimensional features can be processed in a set of
lower-dimensional subspaces in parallel without much performance compromise.
The similarity function is learned online using a stochastic gradient descent
algorithm in which the triplet sampling strategy is customized for quick
convergence of classification performance. The effectiveness of our proposed
model is validated on several data sets with scales varying from tens of
thousands to one million images. Recognition accuracies competitive with the
state-of-the-art performance are achieved with much higher efficiency and
scalability
Supervised mid-level features for word image representation
This paper addresses the problem of learning word image representations:
given the cropped image of a word, we are interested in finding a descriptive,
robust, and compact fixed-length representation. Machine learning techniques
can then be supplied with these representations to produce models useful for
word retrieval or recognition tasks. Although many works have focused on the
machine learning aspect once a global representation has been produced, little
work has been devoted to the construction of those base image representations:
most works use standard coding and aggregation techniques directly on top of
standard computer vision features such as SIFT or HOG.
We propose to learn local mid-level features suitable for building word image
representations. These features are learnt by leveraging character bounding box
annotations on a small set of training images. However, contrary to other
approaches that use character bounding box information, our approach does not
rely on detecting the individual characters explicitly at testing time. Our
local mid-level features can then be aggregated to produce a global word image
signature. When pairing these features with the recent word attributes
framework of Almaz\'an et al., we obtain results comparable with or better than
the state-of-the-art on matching and recognition tasks using global descriptors
of only 96 dimensions
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Machine listening intelligence
This manifesto paper will introduce machine listening intelligence, an
integrated research framework for acoustic and musical signals modelling, based
on signal processing, deep learning and computational musicology.Comment: Proceedings of the First International Conference on Deep Learning
and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]
Towards Learning a Universal Non-Semantic Representation of Speech
The ultimate goal of transfer learning is to reduce labeled data requirements
by exploiting a pre-existing embedding model trained for different datasets or
tasks. The visual and language communities have established benchmarks to
compare embeddings, but the speech community has yet to do so. This paper
proposes a benchmark for comparing speech representations on non-semantic
tasks, and proposes a representation based on an unsupervised triplet-loss
objective. The proposed representation outperforms other representations on the
benchmark, and even exceeds state-of-the-art performance on a number of
transfer learning tasks. The embedding is trained on a publicly available
dataset, and it is tested on a variety of low-resource downstream tasks,
including personalization tasks and medical domain. The benchmark, models, and
evaluation code are publicly released
Unsupervised Learning on Neural Network Outputs: with Application in Zero-shot Learning
The outputs of a trained neural network contain much richer information than
just an one-hot classifier. For example, a neural network might give an image
of a dog the probability of one in a million of being a cat but it is still
much larger than the probability of being a car. To reveal the hidden structure
in them, we apply two unsupervised learning algorithms, PCA and ICA, to the
outputs of a deep Convolutional Neural Network trained on the ImageNet of 1000
classes. The PCA/ICA embedding of the object classes reveals their visual
similarity and the PCA/ICA components can be interpreted as common visual
features shared by similar object classes. For an application, we proposed a
new zero-shot learning method, in which the visual features learned by PCA/ICA
are employed. Our zero-shot learning method achieves the state-of-the-art
results on the ImageNet of over 20000 classes
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
- …