33,231 research outputs found
Spherical CNNs on Unstructured Grids
We present an efficient convolution kernel for Convolutional Neural Networks
(CNNs) on unstructured grids using parameterized differential operators while
focusing on spherical signals such as panorama images or planetary signals. To
this end, we replace conventional convolution kernels with linear combinations
of differential operators that are weighted by learnable parameters.
Differential operators can be efficiently estimated on unstructured grids using
one-ring neighbors, and learnable parameters can be optimized through standard
back-propagation. As a result, we obtain extremely efficient neural networks
that match or outperform state-of-the-art network architectures in terms of
performance but with a significantly lower number of network parameters. We
evaluate our algorithm in an extensive series of experiments on a variety of
computer vision and climate science tasks, including shape classification,
climate pattern segmentation, and omnidirectional image semantic segmentation.
Overall, we present (1) a novel CNN approach on unstructured grids using
parameterized differential operators for spherical signals, and (2) we show
that our unique kernel parameterization allows our model to achieve the same or
higher accuracy with significantly fewer network parameters.Comment: Accepted as a conference paper at ICLR 2019. Codes available at
https://github.com/maxjiang93/ugscn
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
Describing like humans: on diversity in image captioning
Recently, the state-of-the-art models for image captioning have overtaken
human performance based on the most popular metrics, such as BLEU, METEOR,
ROUGE, and CIDEr. Does this mean we have solved the task of image captioning?
The above metrics only measure the similarity of the generated caption to the
human annotations, which reflects its accuracy. However, an image contains many
concepts and multiple levels of detail, and thus there is a variety of captions
that express different concepts and details that might be interesting for
different humans. Therefore only evaluating accuracy is not sufficient for
measuring the performance of captioning models --- the diversity of the
generated captions should also be considered. In this paper, we proposed a new
metric for measuring the diversity of image captions, which is derived from
latent semantic analysis and kernelized to use CIDEr similarity. We conduct
extensive experiments to re-evaluate recent captioning models in the context of
both diversity and accuracy. We find that there is still a large gap between
the model and human performance in terms of both accuracy and diversity and the
models that have optimized accuracy (CIDEr) have low diversity. We also show
that balancing the cross-entropy loss and CIDEr reward in reinforcement
learning during training can effectively control the tradeoff between diversity
and accuracy of the generated captions.Comment: Accepted by CVPR2019. In this version, we correct the label of y axis
in figure
An Empirical Study of Spatial Attention Mechanisms in Deep Networks
Attention mechanisms have become a popular component in deep neural networks,
yet there has been little examination of how different influencing factors and
methods for computing attention from these factors affect performance. Toward a
better general understanding of attention mechanisms, we present an empirical
study that ablates various spatial attention elements within a generalized
attention formulation, encompassing the dominant Transformer attention as well
as the prevalent deformable convolution and dynamic convolution modules.
Conducted on a variety of applications, the study yields significant findings
about spatial attention in deep networks, some of which run counter to
conventional understanding. For example, we find that the query and key content
comparison in Transformer attention is negligible for self-attention, but vital
for encoder-decoder attention. A proper combination of deformable convolution
with key content only saliency achieves the best accuracy-efficiency tradeoff
in self-attention. Our results suggest that there exists much room for
improvement in the design of attention mechanisms
Learning to Hash for Indexing Big Data - A Survey
The explosive growth in big data has attracted much attention in designing
efficient indexing and search methods recently. In many critical applications
such as large-scale search and pattern matching, finding the nearest neighbors
to a query is a fundamental research problem. However, the straightforward
solution using exhaustive comparison is infeasible due to the prohibitive
computational complexity and memory requirement. In response, Approximate
Nearest Neighbor (ANN) search based on hashing techniques has become popular
due to its promising performance in both efficiency and accuracy. Prior
randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore
data-independent hash functions with random projections or permutations.
Although having elegant theoretic guarantees on the search quality in certain
metric spaces, performance of randomized hashing has been shown insufficient in
many real-world applications. As a remedy, new approaches incorporating
data-driven learning methods in development of advanced hash functions have
emerged. Such learning to hash methods exploit information such as data
distributions or class labels when optimizing the hash codes or functions.
Importantly, the learned hash codes are able to preserve the proximity of
neighboring data in the original feature spaces in the hash code spaces. The
goal of this paper is to provide readers with systematic understanding of
insights, pros and cons of the emerging techniques. We provide a comprehensive
survey of the learning to hash framework and representative techniques of
various types, including unsupervised, semi-supervised, and supervised. In
addition, we also summarize recent hashing approaches utilizing the deep
learning models. Finally, we discuss the future direction and trends of
research in this area
On Regularized Losses for Weakly-supervised CNN Segmentation
Minimization of regularized losses is a principled approach to weak
supervision well-established in deep learning, in general. However, it is
largely overlooked in semantic segmentation currently dominated by methods
mimicking full supervision via "fake" fully-labeled training masks (proposals)
generated from available partial input. To obtain such full masks the typical
methods explicitly use standard regularization techniques for "shallow"
segmentation, e.g. graph cuts or dense CRFs. In contrast, we integrate such
standard regularizers directly into the loss functions over partial input. This
approach simplifies weakly-supervised training by avoiding extra MRF/CRF
inference steps or layers explicitly generating full masks, while improving
both the quality and efficiency of training. This paper proposes and
experimentally compares different losses integrating MRF/CRF regularization
terms. We juxtapose our regularized losses with earlier proposal-generation
methods using explicit regularization steps or layers. Our approach achieves
state-of-the-art accuracy in semantic segmentation with near full-supervision
quality
VV-Net: Voxel VAE Net with Group Convolutions for Point Cloud Segmentation
We present a novel algorithm for point cloud segmentation. Our approach
transforms unstructured point clouds into regular voxel grids, and further uses
a kernel-based interpolated variational autoencoder (VAE) architecture to
encode the local geometry within each voxel. Traditionally, the voxel
representation only comprises Boolean occupancy information which fails to
capture the sparsely distributed points within voxels in a compact manner. In
order to handle sparse distributions of points, we further employ radial basis
functions (RBF) to compute a local, continuous representation within each
voxel. Our approach results in a good volumetric representation that
effectively tackles noisy point cloud datasets and is more robust for learning.
Moreover, we further introduce group equivariant CNN to 3D, by defining the
convolution operator on a symmetry group acting on and its
isomorphic sets. This improves the expressive capacity without increasing
parameters, leading to more robust segmentation results. We highlight the
performance on standard benchmarks and show that our approach outperforms
state-of-the-art segmentation algorithms on the ShapeNet and S3DIS datasets.Comment: Accepted by International Conference on Computer Vision (ICCV) 201
Zero-Shot Kernel Learning
In this paper, we address an open problem of zero-shot learning. Its
principle is based on learning a mapping that associates feature vectors
extracted from i.e. images and attribute vectors that describe objects and/or
scenes of interest. In turns, this allows classifying unseen object classes
and/or scenes by matching feature vectors via mapping to a newly defined
attribute vector describing a new class. Due to importance of such a learning
task, there exist many methods that learn semantic, probabilistic, linear or
piece-wise linear mappings. In contrast, we apply well-established kernel
methods to learn a non-linear mapping between the feature and attribute spaces.
We propose an easy learning objective inspired by the Linear Discriminant
Analysis, Kernel-Target Alignment and Kernel Polarization methods that promotes
incoherence. We evaluate performance of our algorithm on the Polynomial as well
as shift-invariant Gaussian and Cauchy kernels. Despite simplicity of our
approach, we obtain state-of-the-art results on several zero-shot learning
datasets and benchmarks including a recent AWA2 dataset.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
A Unified Model for Near and Remote Sensing
We propose a novel convolutional neural network architecture for estimating
geospatial functions such as population density, land cover, or land use. In
our approach, we combine overhead and ground-level images in an end-to-end
trainable neural network, which uses kernel regression and density estimation
to convert features extracted from the ground-level images into a dense feature
map. The output of this network is a dense estimate of the geospatial function
in the form of a pixel-level labeling of the overhead image. To evaluate our
approach, we created a large dataset of overhead and ground-level images from a
major urban area with three sets of labels: land use, building function, and
building age. We find that our approach is more accurate for all tasks, in some
cases dramatically so.Comment: International Conference on Computer Vision (ICCV) 201
- …