3,780 research outputs found

    Fast Parallel SVM using Data Augmentation

    Full text link
    As one of the most popular classifiers, linear SVMs still have challenges in dealing with very large-scale problems, even though linear or sub-linear algorithms have been developed recently on single machines. Parallel computing methods have been developed for learning large-scale SVMs. However, existing methods rely on solving local sub-optimization problems. In this paper, we develop a novel parallel algorithm for learning large-scale linear SVM. Our approach is based on a data augmentation equivalent formulation, which casts the problem of learning SVM as a Bayesian inference problem, for which we can develop very efficient parallel sampling methods. We provide empirical results for this parallel sampling SVM, and provide extensions for SVR, non-linear kernels, and provide a parallel implementation of the Crammer and Singer model. This approach is very promising in its own right, and further is a very useful technique to parallelize a broader family of general maximum-margin models

    Learning Random Fourier Features by Hybrid Constrained Optimization

    Full text link
    The kernel embedding algorithm is an important component for adapting kernel methods to large datasets. Since the algorithm consumes a major computation cost in the testing phase, we propose a novel teacher-learner framework of learning computation-efficient kernel embeddings from specific data. In the framework, the high-precision embeddings (teacher) transfer the data information to the computation-efficient kernel embeddings (learner). We jointly select informative embedding functions and pursue an orthogonal transformation between two embeddings. We propose a novel approach of constrained variational expectation maximization (CVEM), where the alternate direction method of multiplier (ADMM) is applied over a nonconvex domain in the maximization step. We also propose two specific formulations based on the prevalent Random Fourier Feature (RFF), the masked and blocked version of Computation-Efficient RFF (CERF), by imposing a random binary mask or a block structure on the transformation matrix. By empirical studies of several applications on different real-world datasets, we demonstrate that the CERF significantly improves the performance of kernel methods upon the RFF, under certain arithmetic operation requirements, and suitable for structured matrix multiplication in Fastfood type algorithms

    Not-So-Random Features

    Full text link
    We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.Comment: Published as a conference paper at ICLR 201

    An Efficient Primal-Dual Prox Method for Non-Smooth Optimization

    Full text link
    We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of O(1/T)O(1/T) {assuming that the proximal step can be efficiently solved}, significantly faster than a standard subgradient descent method that has an O(1/T)O(1/\sqrt{T}) convergence rate. Our empirical study verifies the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods

    Doubly Stochastic Primal-Dual Coordinate Method for Bilinear Saddle-Point Problem

    Full text link
    We propose a doubly stochastic primal-dual coordinate optimization algorithm for empirical risk minimization, which can be formulated as a bilinear saddle-point problem. In each iteration, our method randomly samples a block of coordinates of the primal and dual solutions to update. The linear convergence of our method could be established in terms of 1) the distance from the current iterate to the optimal solution and 2) the primal-dual objective gap. We show that the proposed method has a lower overall complexity than existing coordinate methods when either the data matrix has a factorized structure or the proximal mapping on each block is computationally expensive, e.g., involving an eigenvalue decomposition. The efficiency of the proposed method is confirmed by empirical studies on several real applications, such as the multi-task large margin nearest neighbor problem

    Learning Data-adaptive Nonparametric Kernels

    Full text link
    Traditional kernels or their combinations are often not sufficiently flexible to fit the data in complicated practical tasks. In this paper, we present a Data-Adaptive Nonparametric Kernel (DANK) learning framework by imposing an adaptive matrix on the kernel/Gram matrix in an entry-wise strategy. Since we do not specify the formulation of the adaptive matrix, each entry in it can be directly and flexibly learned from the data. Therefore, the solution space of the learned kernel is largely expanded, which makes DANK flexible to adapt to the data. Specifically, the proposed kernel learning framework can be seamlessly embedded to support vector machines (SVM) and support vector regression (SVR), which has the capability of enlarging the margin between classes and reducing the model generalization error. Theoretically, we demonstrate that the objective function of our devised model is gradient-Lipschitz continuous. Thereby, the training process for kernel and parameter learning in SVM/SVR can be efficiently optimized in a unified framework. Further, to address the scalability issue in DANK, a decomposition-based scalable approach is developed, of which the effectiveness is demonstrated by both empirical studies and theoretical guarantees. Experimentally, our method outperforms other representative kernel learning based algorithms on various classification and regression benchmark datasets

    MaxiMin Active Learning in Overparameterized Model Classes}

    Full text link
    Generating labeled training datasets has become a major bottleneck in Machine Learning (ML) pipelines. Active ML aims to address this issue by designing learning algorithms that automatically and adaptively select the most informative examples for labeling so that human time is not wasted labeling irrelevant, redundant, or trivial examples. This paper proposes a new approach to active ML with nonparametric or overparameterized models such as kernel methods and neural networks. In the context of binary classification, the new approach is shown to possess a variety of desirable properties that allow active learning algorithms to automatically and efficiently identify decision boundaries and data clusters.Comment: 43 pages, 12 figure

    Towards Ultrahigh Dimensional Feature Selection for Big Data

    Full text link
    In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data. To solve this problem effectively, we first reformulate it as a convex semi-infinite programming (SIP) problem and then propose an efficient \emph{feature generating paradigm}. In contrast with traditional gradient-based approaches that conduct optimization on all input features, the proposed method iteratively activates a group of features and solves a sequence of multiple kernel learning (MKL) subproblems of much reduced scale. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such an optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm can guarantee that the solution converges globally under mild conditions and achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world datasets containing tens of million data points with O(1014)O(10^{14}) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency.Comment: 61 page

    A Survey on Learning to Hash

    Full text link
    Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.Comment: To appear in IEEE Transactions On Pattern Analysis and Machine Intelligence (TPAMI

    A Convex Relaxation for Weakly Supervised Classifiers

    Full text link
    This paper introduces a general multi-class approach to weakly supervised classification. Inferring the labels and learning the parameters of the model is usually done jointly through a block-coordinate descent algorithm such as expectation-maximization (EM), which may lead to local minima. To avoid this problem, we propose a cost function based on a convex relaxation of the soft-max loss. We then propose an algorithm specifically designed to efficiently solve the corresponding semidefinite program (SDP). Empirically, our method compares favorably to standard ones on different datasets for multiple instance learning and semi-supervised learning as well as on clustering tasks.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012
    • …
    corecore