4,385 research outputs found

    Boosting with Structural Sparsity: A Differential Inclusion Approach

    Full text link
    Boosting as gradient descent algorithms is one popular method in machine learning. In this paper a novel Boosting-type algorithm is proposed based on restricted gradient descent with structural sparsity control whose underlying dynamics are governed by differential inclusions. In particular, we present an iterative regularization path with structural sparsity where the parameter is sparse under some linear transforms, based on variable splitting and the Linearized Bregman Iteration. Hence it is called \emph{Split LBI}. Despite its simplicity, Split LBI outperforms the popular generalized Lasso in both theory and experiments. A theory of path consistency is presented that equipped with a proper early stopping, Split LBI may achieve model selection consistency under a family of Irrepresentable Conditions which can be weaker than the necessary and sufficient condition for generalized Lasso. Furthermore, some â„“2\ell_2 error bounds are also given at the minimax optimal rates. The utility and benefit of the algorithm are illustrated by several applications including image denoising, partial order ranking of sport teams, and world university grouping with crowdsourced ranking data

    A Direct Approach to Multi-class Boosting and Extensions

    Full text link
    Boosting methods combine a set of moderately accurate weaklearners to form a highly accurate predictor. Despite the practical importance of multi-class boosting, it has received far less attention than its binary counterpart. In this work, we propose a fully-corrective multi-class boosting formulation which directly solves the multi-class problem without dividing it into multiple binary classification problems. In contrast, most previous multi-class boosting algorithms decompose a multi-boost problem into multiple binary boosting problems. By explicitly deriving the Lagrange dual of the primal optimization problem, we are able to construct a column generation-based fully-corrective approach to boosting which directly optimizes multi-class classification performance. The new approach not only updates all weak learners' coefficients at every iteration, but does so in a manner flexible enough to accommodate various loss functions and regularizations. For example, it enables us to introduce structural sparsity through mixed-norm regularization to promote group sparsity and feature sharing. Boosting with shared features is particularly beneficial in complex prediction problems where features can be expensive to compute. Our experiments on various data sets demonstrate that our direct multi-class boosting generalizes as well as, or better than, a range of competing multi-class boosting methods. The end result is a highly effective and compact ensemble classifier which can be trained in a distributed fashion.Comment: 34 page

    S2S^{2}-LBI: Stochastic Split Linearized Bregman Iterations for Parsimonious Deep Learning

    Full text link
    This paper proposes a novel Stochastic Split Linearized Bregman Iteration (S2S^{2}-LBI) algorithm to efficiently train the deep network. The S2S^{2}-LBI introduces an iterative regularization path with structural sparsity. Our S2S^{2}-LBI combines the computational efficiency of the LBI, and model selection consistency in learning the structural sparsity. The computed solution path intrinsically enables us to enlarge or simplify a network, which theoretically, is benefited from the dynamics property of our S2S^{2}-LBI algorithm. The experimental results validate our S2S^{2}-LBI on MNIST and CIFAR-10 dataset. For example, in MNIST, we can either boost a network with only 1.5K parameters (1 convolutional layer of 5 filters, and 1 FC layer), achieves 98.40\% recognition accuracy; or we simplify 82.5%82.5\% of parameters in LeNet-5 network, and still achieves the 98.47\% recognition accuracy. In addition, we also have the learning results on ImageNet, which will be added in the next version of our report.Comment: technical repor

    Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence

    Full text link
    Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling an over-parameterized model to compressive ones, we propose a parsimonious learning approach based on differential inclusions of inverse scale spaces, that generates a family of models from simple to complex ones with a better efficiency and interpretability than stochastic gradient descent in exploring the model space. It enjoys a simple discretization, the Split Linearized Bregman Iterations, with provable global convergence that from any initializations, algorithmic iterations converge to a critical point of empirical risks. One may exploit the proposed method to boost the complexity of neural networks progressively. Numerical experiments with MNIST, Cifar-10/100, and ImageNet are conducted to show the method is promising in training large scale models with a favorite interpretability.Comment: 25 pages, 7 figure

    CAM: Causal additive models, high-dimensional order search and penalized regression

    Full text link
    We develop estimation for potentially high-dimensional additive structural equation models. A key component of our approach is to decouple order search among the variables from feature or edge selection in a directed acyclic graph encoding the causal structure. We show that the former can be done with nonregularized (restricted) maximum likelihood estimation while the latter can be efficiently addressed using sparse regression techniques. Thus, we substantially simplify the problem of structure search and estimation for an important class of causal models. We establish consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and we also allow for misspecification of the error distribution. Furthermore, we develop an efficient computational algorithm which can deal with many variables, and the new method's accuracy and performance is illustrated on simulated and real data.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1260 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Totally Corrective Boosting with Cardinality Penalization

    Full text link
    We propose a totally corrective boosting algorithm with explicit cardinality regularization. The resulting combinatorial optimization problems are not known to be efficiently solvable with existing classical methods, but emerging quantum optimization technology gives hope for achieving sparser models in practice. In order to demonstrate the utility of our algorithm, we use a distributed classical heuristic optimizer as a stand-in for quantum hardware. Even though this evaluation methodology incurs large time and resource costs on classical computing machinery, it allows us to gauge the potential gains in generalization performance and sparsity of the resulting boosted ensembles. Our experimental results on public data sets commonly used for benchmarking of boosting algorithms decidedly demonstrate the existence of such advantages. If actual quantum optimization were to be used with this algorithm in the future, we would expect equivalent or superior results at much smaller time and energy costs during training. Moreover, studying cardinality-penalized boosting also sheds light on why unregularized boosting algorithms with early stopping often yield better results than their counterparts with explicit convex regularization: Early stopping performs suboptimal cardinality regularization. The results that we present here indicate it is beneficial to explicitly solve the combinatorial problem still left open at early termination

    A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

    Full text link
    In this paper we analyze boosting algorithms in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FSε_\varepsilon) and least squares boosting (LS-Boost(ε\varepsilon)), can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a modification of FSε_\varepsilon that yields an algorithm for the Lasso, and that may be easily extended to an algorithm that computes the Lasso path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the Lasso may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-Boost(ε\varepsilon) and FSε_\varepsilon) by using techniques of modern first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset

    Boosted Sparse and Low-Rank Tensor Regression

    Full text link
    We propose a sparse and low-rank tensor regression model to relate a univariate outcome to a feature tensor, in which each unit-rank tensor from the CP decomposition of the coefficient tensor is assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor regression problems. To make the computation efficient and scalable, for the unit-rank tensor regression, we propose a stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The superior performance of our approach is demonstrated on various real-world and synthetic examples.Comment: 10 pages, 5 figures, NIPS 201

    On the benefits of output sparsity for multi-label classification

    Full text link
    The multi-label classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multi-label problems are typically large-scale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedies have been proposed to overcome the curse of dimensionality. In this work, we aim at exploiting the output sparsity by introducing a new loss, called the sparse weighted Hamming loss. This proposed loss can be seen as a weighted version of classical ones, where active and inactive labels are weighted separately. Leveraging the influence of sparsity in the loss function, we provide improved generalization bounds for the empirical risk minimizer, a suitable property for large-scale problems. For this new loss, we derive rates of convergence linear in the underlying output-sparsity rather than linear in the number of labels. In practice, minimizing the associated risk can be performed efficiently by using convex surrogates and modern convex optimization algorithms. We provide experiments on various real-world datasets demonstrating the pertinence of our approach when compared to non-weighted techniques

    Complexities of convex combinations and bounding the generalization error in classification

    Full text link
    We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new upper confidence bounds on the generalization error of ensemble (voting) classification algorithms that utilize the new complexity measures along with the empirical distributions of classification margins, providing a better explanation of generalization performance of large margin classification methods.Comment: Published at http://dx.doi.org/10.1214/009053605000000228 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore