16,736 research outputs found
Similarity Learning for High-Dimensional Sparse Data
A good measure of similarity between data points is crucial to many tasks in
machine learning. Similarity and metric learning methods learn such measures
automatically from data, but they do not scale well respect to the
dimensionality of the data. In this paper, we propose a method that can learn
efficiently similarity measure from high-dimensional sparse data. The core idea
is to parameterize the similarity measure as a convex combination of rank-one
matrices with specific sparsity structures. The parameters are then optimized
with an approximate Frank-Wolfe procedure to maximally satisfy relative
similarity constraints on the training data. Our algorithm greedily
incorporates one pair of features at a time into the similarity measure,
providing an efficient way to control the number of active features and thus
reduce overfitting. It enjoys very appealing convergence guarantees and its
time and memory complexity depends on the sparsity of the data instead of the
dimension of the feature space. Our experiments on real-world high-dimensional
datasets demonstrate its potential for classification, dimensionality reduction
and data exploration.Comment: 14 pages. Proceedings of the 18th International Conference on
Artificial Intelligence and Statistics (AISTATS 2015). Matlab code:
https://github.com/bellet/HDS
Distribution matching for transduction
Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach.
Deep Network Flow for Multi-Object Tracking
Data association problems are an important component of many computer vision
applications, with multi-object tracking being one of the most prominent
examples. A typical approach to data association involves finding a graph
matching or network flow that minimizes a sum of pairwise association costs,
which are often either hand-crafted or learned as linear functions of fixed
features. In this work, we demonstrate that it is possible to learn features
for network-flow-based data association via backpropagation, by expressing the
optimum of a smoothed network flow problem as a differentiable function of the
pairwise association costs. We apply this approach to multi-object tracking
with a network flow formulation. Our experiments demonstrate that we are able
to successfully learn all cost functions for the association problem in an
end-to-end fashion, which outperform hand-crafted costs in all settings. The
integration and combination of various sources of inputs becomes easy and the
cost functions can be learned entirely from data, alleviating tedious
hand-designing of costs.Comment: Accepted to CVPR 201
An ontology enhanced parallel SVM for scalable spam filter training
This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart
- …