3,780 research outputs found
Fast Parallel SVM using Data Augmentation
As one of the most popular classifiers, linear SVMs still have challenges in
dealing with very large-scale problems, even though linear or sub-linear
algorithms have been developed recently on single machines. Parallel computing
methods have been developed for learning large-scale SVMs. However, existing
methods rely on solving local sub-optimization problems. In this paper, we
develop a novel parallel algorithm for learning large-scale linear SVM. Our
approach is based on a data augmentation equivalent formulation, which casts
the problem of learning SVM as a Bayesian inference problem, for which we can
develop very efficient parallel sampling methods. We provide empirical results
for this parallel sampling SVM, and provide extensions for SVR, non-linear
kernels, and provide a parallel implementation of the Crammer and Singer model.
This approach is very promising in its own right, and further is a very useful
technique to parallelize a broader family of general maximum-margin models
Learning Random Fourier Features by Hybrid Constrained Optimization
The kernel embedding algorithm is an important component for adapting kernel
methods to large datasets. Since the algorithm consumes a major computation
cost in the testing phase, we propose a novel teacher-learner framework of
learning computation-efficient kernel embeddings from specific data. In the
framework, the high-precision embeddings (teacher) transfer the data
information to the computation-efficient kernel embeddings (learner). We
jointly select informative embedding functions and pursue an orthogonal
transformation between two embeddings. We propose a novel approach of
constrained variational expectation maximization (CVEM), where the alternate
direction method of multiplier (ADMM) is applied over a nonconvex domain in the
maximization step. We also propose two specific formulations based on the
prevalent Random Fourier Feature (RFF), the masked and blocked version of
Computation-Efficient RFF (CERF), by imposing a random binary mask or a block
structure on the transformation matrix. By empirical studies of several
applications on different real-world datasets, we demonstrate that the CERF
significantly improves the performance of kernel methods upon the RFF, under
certain arithmetic operation requirements, and suitable for structured matrix
multiplication in Fastfood type algorithms
Not-So-Random Features
We propose a principled method for kernel learning, which relies on a
Fourier-analytic characterization of translation-invariant or
rotation-invariant kernels. Our method produces a sequence of feature maps,
iteratively refining the SVM margin. We provide rigorous guarantees for
optimality and generalization, interpreting our algorithm as online
equilibrium-finding dynamics in a certain two-player min-max game. Evaluations
on synthetic and real-world datasets demonstrate scalability and consistent
improvements over related random features-based methods.Comment: Published as a conference paper at ICLR 201
An Efficient Primal-Dual Prox Method for Non-Smooth Optimization
We study the non-smooth optimization problems in machine learning, where both
the loss function and the regularizer are non-smooth functions. Previous
studies on efficient empirical loss minimization assume either a smooth loss
function or a strongly convex regularizer, making them unsuitable for
non-smooth optimization. We develop a simple yet efficient method for a family
of non-smooth optimization problems where the dual form of the loss function is
bilinear in primal and dual variables. We cast a non-smooth optimization
problem into a minimax optimization problem, and develop a primal dual prox
method that solves the minimax optimization problem at a rate of
{assuming that the proximal step can be efficiently solved}, significantly
faster than a standard subgradient descent method that has an
convergence rate. Our empirical study verifies the efficiency of the proposed
method for various non-smooth optimization problems that arise ubiquitously in
machine learning by comparing it to the state-of-the-art first order methods
Doubly Stochastic Primal-Dual Coordinate Method for Bilinear Saddle-Point Problem
We propose a doubly stochastic primal-dual coordinate optimization algorithm
for empirical risk minimization, which can be formulated as a bilinear
saddle-point problem. In each iteration, our method randomly samples a block of
coordinates of the primal and dual solutions to update. The linear convergence
of our method could be established in terms of 1) the distance from the current
iterate to the optimal solution and 2) the primal-dual objective gap. We show
that the proposed method has a lower overall complexity than existing
coordinate methods when either the data matrix has a factorized structure or
the proximal mapping on each block is computationally expensive, e.g.,
involving an eigenvalue decomposition. The efficiency of the proposed method is
confirmed by empirical studies on several real applications, such as the
multi-task large margin nearest neighbor problem
Learning Data-adaptive Nonparametric Kernels
Traditional kernels or their combinations are often not sufficiently flexible
to fit the data in complicated practical tasks. In this paper, we present a
Data-Adaptive Nonparametric Kernel (DANK) learning framework by imposing an
adaptive matrix on the kernel/Gram matrix in an entry-wise strategy. Since we
do not specify the formulation of the adaptive matrix, each entry in it can be
directly and flexibly learned from the data. Therefore, the solution space of
the learned kernel is largely expanded, which makes DANK flexible to adapt to
the data. Specifically, the proposed kernel learning framework can be
seamlessly embedded to support vector machines (SVM) and support vector
regression (SVR), which has the capability of enlarging the margin between
classes and reducing the model generalization error. Theoretically, we
demonstrate that the objective function of our devised model is
gradient-Lipschitz continuous. Thereby, the training process for kernel and
parameter learning in SVM/SVR can be efficiently optimized in a unified
framework. Further, to address the scalability issue in DANK, a
decomposition-based scalable approach is developed, of which the effectiveness
is demonstrated by both empirical studies and theoretical guarantees.
Experimentally, our method outperforms other representative kernel learning
based algorithms on various classification and regression benchmark datasets
MaxiMin Active Learning in Overparameterized Model Classes}
Generating labeled training datasets has become a major bottleneck in Machine
Learning (ML) pipelines. Active ML aims to address this issue by designing
learning algorithms that automatically and adaptively select the most
informative examples for labeling so that human time is not wasted labeling
irrelevant, redundant, or trivial examples. This paper proposes a new approach
to active ML with nonparametric or overparameterized models such as kernel
methods and neural networks. In the context of binary classification, the new
approach is shown to possess a variety of desirable properties that allow
active learning algorithms to automatically and efficiently identify decision
boundaries and data clusters.Comment: 43 pages, 12 figure
Towards Ultrahigh Dimensional Feature Selection for Big Data
In this paper, we present a new adaptive feature scaling scheme for
ultrahigh-dimensional feature selection on Big Data. To solve this problem
effectively, we first reformulate it as a convex semi-infinite programming
(SIP) problem and then propose an efficient \emph{feature generating paradigm}.
In contrast with traditional gradient-based approaches that conduct
optimization on all input features, the proposed method iteratively activates a
group of features and solves a sequence of multiple kernel learning (MKL)
subproblems of much reduced scale. To further speed up the training, we propose
to solve the MKL subproblems in their primal forms through a modified
accelerated proximal gradient approach. Due to such an optimization scheme,
some efficient cache techniques are also developed. The feature generating
paradigm can guarantee that the solution converges globally under mild
conditions and achieve lower feature selection bias. Moreover, the proposed
method can tackle two challenging tasks in feature selection: 1) group-based
feature selection with complex structures and 2) nonlinear feature selection
with explicit feature mappings. Comprehensive experiments on a wide range of
synthetic and real-world datasets containing tens of million data points with
features demonstrate the competitive performance of the proposed
method over state-of-the-art feature selection methods in terms of
generalization performance and training efficiency.Comment: 61 page
A Survey on Learning to Hash
Nearest neighbor search is a problem of finding the data points from the
database such that the distances from them to the query point are the smallest.
Learning to hash is one of the major solutions to this problem and has been
widely studied recently. In this paper, we present a comprehensive survey of
the learning to hash algorithms, categorize them according to the manners of
preserving the similarities into: pairwise similarity preserving, multiwise
similarity preserving, implicit similarity preserving, as well as quantization,
and discuss their relations. We separate quantization from pairwise similarity
preserving as the objective function is very different though quantization, as
we show, can be derived from preserving the pairwise similarities. In addition,
we present the evaluation protocols, and the general performance analysis, and
point out that the quantization algorithms perform superiorly in terms of
search accuracy, search time cost, and space cost. Finally, we introduce a few
emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine
Intelligence (TPAMI
A Convex Relaxation for Weakly Supervised Classifiers
This paper introduces a general multi-class approach to weakly supervised
classification. Inferring the labels and learning the parameters of the model
is usually done jointly through a block-coordinate descent algorithm such as
expectation-maximization (EM), which may lead to local minima. To avoid this
problem, we propose a cost function based on a convex relaxation of the
soft-max loss. We then propose an algorithm specifically designed to
efficiently solve the corresponding semidefinite program (SDP). Empirically,
our method compares favorably to standard ones on different datasets for
multiple instance learning and semi-supervised learning as well as on
clustering tasks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
- …