5,645 research outputs found
Learning Deep Context-Network Architectures for Image Annotation
Context plays an important role in visual pattern recognition as it provides
complementary clues for different learning tasks including image classification
and annotation. In the particular scenario of kernel learning, the general
recipe of context-based kernel design consists in learning positive
semi-definite similarity functions that return high values not only when data
share similar content but also similar context. However, in spite of having a
positive impact on performance, the use of context in these kernel design
methods has not been fully explored; indeed, context has been handcrafted
instead of being learned. In this paper, we introduce a novel context-aware
kernel design framework based on deep learning. Our method discriminatively
learns spatial geometric context as the weights of a deep network (DN). The
architecture of this network is fully determined by the solution of an
objective function that mixes content, context and regularization, while the
parameters of this network determine the most relevant (discriminant) parts of
the learned context. We apply this context and kernel learning framework to
image classification using the challenging ImageCLEF Photo Annotation
benchmark; the latter shows that our deep context learning provides highly
effective kernels for image classification as corroborated through extensive
experiments
Learning Explicit Deep Representations from Deep Kernel Networks
Deep kernel learning aims at designing nonlinear combinations of multiple
standard elementary kernels by training deep networks. This scheme has proven
to be effective, but intractable when handling large-scale datasets especially
when the depth of the trained networks increases; indeed, the complexity of
evaluating these networks scales quadratically w.r.t. the size of training data
and linearly w.r.t. the depth of the trained networks. In this paper, we
address the issue of efficient computation in Deep Kernel Networks (DKNs) by
designing effective maps in the underlying Reproducing Kernel Hilbert Spaces.
Given a pretrained DKN, our method builds its associated Deep Map Network (DMN)
whose inner product approximates the original network while being far more
efficient. The design principle of our method is greedy and achieved
layer-wise, by finding maps that approximate DKNs at different (input,
intermediate and output) layers. This design also considers an extra
fine-tuning step based on unsupervised learning, that further enhances the
generalization ability of the trained DMNs. When plugged into SVMs, these DMNs
turn out to be as accurate as the underlying DKNs while being at least an order
of magnitude faster on large-scale datasets, as shown through extensive
experiments on the challenging ImageCLEF and COREL5k benchmarks
Deep Context-Aware Kernel Networks
Context plays a crucial role in visual recognition as it provides
complementary clues for different learning tasks including image classification
and annotation. As the performances of these tasks are currently reaching a
plateau, any extra knowledge, including context, should be leveraged in order
to seek significant leaps in these performances. In the particular scenario of
kernel machines, context-aware kernel design aims at learning positive
semi-definite similarity functions which return high values not only when data
share similar contents, but also similar structures (a.k.a contexts). However,
the use of context in kernel design has not been fully explored; indeed,
context in these solutions is handcrafted instead of being learned. In this
paper, we introduce a novel deep network architecture that learns context in
kernel design. This architecture is fully determined by the solution of an
objective function mixing a content term that captures the intrinsic similarity
between data, a context criterion which models their structure and a
regularization term that helps designing smooth kernel network representations.
The solution of this objective function defines a particular deep network
architecture whose parameters correspond to different variants of learned
contexts including layerwise, stationary and classwise; larger values of these
parameters correspond to the most influencing contextual relationships between
data. Extensive experiments conducted on the challenging ImageCLEF Photo
Annotation and Corel5k benchmarks show that our deep context networks are
highly effective for image classification and the learned contexts further
enhance the performance of image annotation
End-to-end training of deep kernel map networks for image classification
Deep kernel map networks have shown excellent performances in various
classification problems including image annotation. Their general recipe
consists in aggregating several layers of singular value decompositions (SVDs)
-- that map data from input spaces into high dimensional spaces -- while
preserving the similarity of the underlying kernels. However, the potential of
these deep map networks has not been fully explored as the original setting of
these networks focuses mainly on the approximation quality of their kernels and
ignores their discrimination power. In this paper, we introduce a novel
"end-to-end" design for deep kernel map learning that balances the
approximation quality of kernels and their discrimination power. Our method
proceeds in two steps; first, layerwise SVD is applied in order to build
initial deep kernel map approximations and then an "end-to-end" supervised
learning is employed to further enhance their discrimination power while
maintaining their efficiency. Extensive experiments, conducted on the
challenging ImageCLEF annotation benchmark, show the high efficiency and the
out-performance of this two-step process with respect to different related
methods
Asymmetrically Weighted CCA And Hierarchical Kernel Sentence Embedding For Image & Text Retrieval
Joint modeling of language and vision has been drawing increasing interest. A
multimodal data representation allowing for bidirectional retrieval of images
by sentences and vice versa is a key aspect. In this paper we present three
contributions in canonical correlation analysis (CCA) based multimodal
retrieval. Firstly, we show that an asymmetric weighting of the canonical
weights, while achieving a cross view mapping from the search to the query
space, improves the retrieval performance. Secondly, we devise a
computationally efficient model selection, crucial to generalization and
stability, in the framework of the Bj\"ork Golub algorithm for regularized CCA
via spectral filtering. Finally, we introduce a Hierarchical Kernel Sentence
Embedding (HKSE) that approximates Kernel CCA for a special similarity kernel
between distribution of words embedded in a vector space. State of the art
results are obtained on MSCOCO and Flickr benchmarks when these three
techniques are used in conjunction.Comment: Under Review CVPR 201
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Deep learning models with convolutional and recurrent networks are now
ubiquitous and analyze massive amounts of audio, image, video, text and graph
data, with applications in automatic translation, speech-to-text, scene
understanding, ranking user preferences, ad placement, etc. Competing
frameworks for building these networks such as TensorFlow, Chainer, CNTK,
Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between
usability and expressiveness, research or production orientation and supported
hardware. They operate on a DAG of computational operators, wrapping
high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for
various CPUs), and automate memory allocation, synchronization, distribution.
Custom operators are needed where the computation does not fit existing
high-performance library calls, usually at a high engineering cost. This is
frequently required when new operators are invented by researchers: such
operators suffer a severe performance penalty, which limits the pace of
innovation. Furthermore, even if there is an existing runtime call these
frameworks can use, it often doesn't offer optimal performance for a user's
particular network architecture and dataset, missing optimizations between
operators as well as optimizations that can be done knowing the size and shape
of data. Our contributions include (1) a language close to the mathematics of
deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time
compiler to convert a mathematical description of a deep learning DAG into a
CUDA kernel with delegated memory management and synchronization, also
providing optimizations such as operator fusion and specialization for specific
sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff
Totally Deep Support Vector Machines
Support vector machines (SVMs) have been successful in solving many computer
vision tasks including image and video category recognition especially for
small and mid-scale training problems. The principle of these non-parametric
models is to learn hyperplanes that separate data belonging to different
classes while maximizing their margins. However, SVMs constrain the learned
hyperplanes to lie in the span of support vectors, fixed/taken from training
data, and this reduces their representational power and may lead to limited
generalization performances. In this paper, we relax this constraint and allow
the support vectors to be learned (instead of being fixed/taken from training
data) in order to better fit a given classification task. Our approach,
referred to as deep total variation support vector machines, is parametric and
relies on a novel deep architecture that learns not only the SVM and the kernel
parameters but also the support vectors, resulting into highly effective
classifiers. We also show (under a particular setting of the activation
functions in this deep architecture) that a large class of kernels and their
combinations can be learned. Experiments conducted on the challenging task of
skeleton-based action recognition show the outperformance of our deep total
variation SVMs w.r.t different baselines as well as the related work
Action Recognition with Kernel-based Graph Convolutional Networks
Learning graph convolutional networks (GCNs) is an emerging field which aims
at generalizing deep learning to arbitrary non-regular domains. Most of the
existing GCNs follow a neighborhood aggregation scheme, where the
representation of a node is recursively obtained by aggregating its neighboring
node representations using averaging or sorting operations. However, these
operations are either ill-posed or weak to be discriminant or increase the
number of training parameters and thereby the computational complexity and the
risk of overfitting. In this paper, we introduce a novel GCN framework that
achieves spatial graph convolution in a reproducing kernel Hilbert space
(RKHS). The latter makes it possible to design, via implicit kernel
representations, convolutional graph filters in a high dimensional and more
discriminating space without increasing the number of training parameters. The
particularity of our GCN model also resides in its ability to achieve
convolutions without explicitly realigning nodes in the receptive fields of the
learned graph filters with those of the input graphs, thereby making
convolutions permutation agnostic and well defined. Experiments conducted on
the challenging task of skeleton-based action recognition show the superiority
of the proposed method against different baselines as well as the related work.Comment: arXiv admin note: text overlap with arXiv:1912.0586
Relative Saliency and Ranking: Models, Metrics, Data, and Benchmarks
Salient object detection is a problem that has been considered in detail and
\textcolor{black}{many solutions have been proposed}. In this paper, we argue
that work to date has addressed a problem that is relatively ill-posed.
Specifically, there is not universal agreement about what constitutes a salient
object when multiple observers are queried. This implies that some objects are
more likely to be judged salient than others, and implies a relative rank
exists on salient objects. Initially, we present a novel deep learning solution
based on a hierarchical representation of relative saliency and stage-wise
refinement. Further to this, we present data, analysis and baseline benchmark
results towards addressing the problem of salient object ranking. Methods for
deriving suitable ranked salient object instances are presented, along with
metrics suitable to measuring algorithm performance. In addition, we show how a
derived dataset can be successively refined to provide cleaned results that
correlate well with pristine ground truth in its characteristics and value for
training and testing models. Finally, we provide a comparison among prevailing
algorithms that address salient object ranking or detection to establish
initial baselines providing a basis for comparison with future efforts
addressing this problem. \textcolor{black}{The source code and data are
publicly available via our project page:}
\textrm{\href{https://ryersonvisionlab.github.io/cocosalrank.html}{ryersonvisionlab.github.io/cocosalrank}}Comment: Accepted to Transaction on Pattern Analysis and Machine Intelligence.
arXiv admin note: substantial text overlap with arXiv:1803.0508
Deep hierarchical pooling design for cross-granularity action recognition
In this paper, we introduce a novel hierarchical aggregation design that
captures different levels of temporal granularity in action recognition. Our
design principle is coarse-to-fine and achieved using a tree-structured
network; as we traverse this network top-down, pooling operations are getting
less invariant but timely more resolute and well localized. Learning the
combination of operations in this network -- which best fits a given
ground-truth -- is obtained by solving a constrained minimization problem whose
solution corresponds to the distribution of weights that capture the
contribution of each level (and thereby temporal granularity) in the global
hierarchical pooling process. Besides being principled and well grounded, the
proposed hierarchical pooling is also video-length agnostic and resilient to
misalignments in actions. Extensive experiments conducted on the challenging
UCF-101 database corroborate these statements
- …