4,385 research outputs found
Boosting with Structural Sparsity: A Differential Inclusion Approach
Boosting as gradient descent algorithms is one popular method in machine
learning. In this paper a novel Boosting-type algorithm is proposed based on
restricted gradient descent with structural sparsity control whose underlying
dynamics are governed by differential inclusions. In particular, we present an
iterative regularization path with structural sparsity where the parameter is
sparse under some linear transforms, based on variable splitting and the
Linearized Bregman Iteration. Hence it is called \emph{Split LBI}. Despite its
simplicity, Split LBI outperforms the popular generalized Lasso in both theory
and experiments. A theory of path consistency is presented that equipped with a
proper early stopping, Split LBI may achieve model selection consistency under
a family of Irrepresentable Conditions which can be weaker than the necessary
and sufficient condition for generalized Lasso. Furthermore, some
error bounds are also given at the minimax optimal rates. The utility and
benefit of the algorithm are illustrated by several applications including
image denoising, partial order ranking of sport teams, and world university
grouping with crowdsourced ranking data
A Direct Approach to Multi-class Boosting and Extensions
Boosting methods combine a set of moderately accurate weaklearners to form a
highly accurate predictor. Despite the practical importance of multi-class
boosting, it has received far less attention than its binary counterpart. In
this work, we propose a fully-corrective multi-class boosting formulation which
directly solves the multi-class problem without dividing it into multiple
binary classification problems. In contrast, most previous multi-class boosting
algorithms decompose a multi-boost problem into multiple binary boosting
problems. By explicitly deriving the Lagrange dual of the primal optimization
problem, we are able to construct a column generation-based fully-corrective
approach to boosting which directly optimizes multi-class classification
performance. The new approach not only updates all weak learners' coefficients
at every iteration, but does so in a manner flexible enough to accommodate
various loss functions and regularizations. For example, it enables us to
introduce structural sparsity through mixed-norm regularization to promote
group sparsity and feature sharing. Boosting with shared features is
particularly beneficial in complex prediction problems where features can be
expensive to compute. Our experiments on various data sets demonstrate that our
direct multi-class boosting generalizes as well as, or better than, a range of
competing multi-class boosting methods. The end result is a highly effective
and compact ensemble classifier which can be trained in a distributed fashion.Comment: 34 page
-LBI: Stochastic Split Linearized Bregman Iterations for Parsimonious Deep Learning
This paper proposes a novel Stochastic Split Linearized Bregman Iteration
(-LBI) algorithm to efficiently train the deep network. The -LBI
introduces an iterative regularization path with structural sparsity. Our
-LBI combines the computational efficiency of the LBI, and model
selection consistency in learning the structural sparsity. The computed
solution path intrinsically enables us to enlarge or simplify a network, which
theoretically, is benefited from the dynamics property of our -LBI
algorithm. The experimental results validate our -LBI on MNIST and
CIFAR-10 dataset. For example, in MNIST, we can either boost a network with
only 1.5K parameters (1 convolutional layer of 5 filters, and 1 FC layer),
achieves 98.40\% recognition accuracy; or we simplify of parameters in
LeNet-5 network, and still achieves the 98.47\% recognition accuracy. In
addition, we also have the learning results on ImageNet, which will be added in
the next version of our report.Comment: technical repor
Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence
Over-parameterization is ubiquitous nowadays in training neural networks to
benefit both optimization in seeking global optima and generalization in
reducing prediction error. However, compressive networks are desired in many
real world applications and direct training of small networks may be trapped in
local optima. In this paper, instead of pruning or distilling an
over-parameterized model to compressive ones, we propose a parsimonious
learning approach based on differential inclusions of inverse scale spaces,
that generates a family of models from simple to complex ones with a better
efficiency and interpretability than stochastic gradient descent in exploring
the model space. It enjoys a simple discretization, the Split Linearized
Bregman Iterations, with provable global convergence that from any
initializations, algorithmic iterations converge to a critical point of
empirical risks. One may exploit the proposed method to boost the complexity of
neural networks progressively. Numerical experiments with MNIST, Cifar-10/100,
and ImageNet are conducted to show the method is promising in training large
scale models with a favorite interpretability.Comment: 25 pages, 7 figure
CAM: Causal additive models, high-dimensional order search and penalized regression
We develop estimation for potentially high-dimensional additive structural
equation models. A key component of our approach is to decouple order search
among the variables from feature or edge selection in a directed acyclic graph
encoding the causal structure. We show that the former can be done with
nonregularized (restricted) maximum likelihood estimation while the latter can
be efficiently addressed using sparse regression techniques. Thus, we
substantially simplify the problem of structure search and estimation for an
important class of causal models. We establish consistency of the (restricted)
maximum likelihood estimator for low- and high-dimensional scenarios, and we
also allow for misspecification of the error distribution. Furthermore, we
develop an efficient computational algorithm which can deal with many
variables, and the new method's accuracy and performance is illustrated on
simulated and real data.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1260 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Totally Corrective Boosting with Cardinality Penalization
We propose a totally corrective boosting algorithm with explicit cardinality
regularization. The resulting combinatorial optimization problems are not known
to be efficiently solvable with existing classical methods, but emerging
quantum optimization technology gives hope for achieving sparser models in
practice. In order to demonstrate the utility of our algorithm, we use a
distributed classical heuristic optimizer as a stand-in for quantum hardware.
Even though this evaluation methodology incurs large time and resource costs on
classical computing machinery, it allows us to gauge the potential gains in
generalization performance and sparsity of the resulting boosted ensembles. Our
experimental results on public data sets commonly used for benchmarking of
boosting algorithms decidedly demonstrate the existence of such advantages. If
actual quantum optimization were to be used with this algorithm in the future,
we would expect equivalent or superior results at much smaller time and energy
costs during training. Moreover, studying cardinality-penalized boosting also
sheds light on why unregularized boosting algorithms with early stopping often
yield better results than their counterparts with explicit convex
regularization: Early stopping performs suboptimal cardinality regularization.
The results that we present here indicate it is beneficial to explicitly solve
the combinatorial problem still left open at early termination
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
In this paper we analyze boosting algorithms in linear regression from a new
perspective: that of modern first-order methods in convex optimization. We show
that classic boosting algorithms in linear regression, namely the incremental
forward stagewise algorithm (FS) and least squares boosting
(LS-Boost()), can be viewed as subgradient descent to minimize the
loss function defined as the maximum absolute correlation between the features
and residuals. We also propose a modification of FS that yields
an algorithm for the Lasso, and that may be easily extended to an algorithm
that computes the Lasso path for different values of the regularization
parameter. Furthermore, we show that these new algorithms for the Lasso may
also be interpreted as the same master algorithm (subgradient descent), applied
to a regularized version of the maximum absolute correlation loss function. We
derive novel, comprehensive computational guarantees for several boosting
algorithms in linear regression (including LS-Boost() and
FS) by using techniques of modern first-order methods in convex
optimization. Our computational guarantees inform us about the statistical
properties of boosting algorithms. In particular they provide, for the first
time, a precise theoretical description of the amount of data-fidelity and
regularization imparted by running a boosting algorithm with a prespecified
learning rate for a fixed but arbitrary number of iterations, for any dataset
Boosted Sparse and Low-Rank Tensor Regression
We propose a sparse and low-rank tensor regression model to relate a
univariate outcome to a feature tensor, in which each unit-rank tensor from the
CP decomposition of the coefficient tensor is assumed to be sparse. This
structure is both parsimonious and highly interpretable, as it implies that the
outcome is related to the features through a few distinct pathways, each of
which may only involve subsets of feature dimensions. We take a
divide-and-conquer strategy to simplify the task into a set of sparse unit-rank
tensor regression problems. To make the computation efficient and scalable, for
the unit-rank tensor regression, we propose a stagewise estimation procedure to
efficiently trace out its entire solution path. We show that as the step size
goes to zero, the stagewise solution paths converge exactly to those of the
corresponding regularized regression. The superior performance of our approach
is demonstrated on various real-world and synthetic examples.Comment: 10 pages, 5 figures, NIPS 201
On the benefits of output sparsity for multi-label classification
The multi-label classification framework, where each observation can be
associated with a set of labels, has generated a tremendous amount of attention
over recent years. The modern multi-label problems are typically large-scale in
terms of number of observations, features and labels, and the amount of labels
can even be comparable with the amount of observations. In this context,
different remedies have been proposed to overcome the curse of dimensionality.
In this work, we aim at exploiting the output sparsity by introducing a new
loss, called the sparse weighted Hamming loss. This proposed loss can be seen
as a weighted version of classical ones, where active and inactive labels are
weighted separately. Leveraging the influence of sparsity in the loss function,
we provide improved generalization bounds for the empirical risk minimizer, a
suitable property for large-scale problems. For this new loss, we derive rates
of convergence linear in the underlying output-sparsity rather than linear in
the number of labels. In practice, minimizing the associated risk can be
performed efficiently by using convex surrogates and modern convex optimization
algorithms. We provide experiments on various real-world datasets demonstrating
the pertinence of our approach when compared to non-weighted techniques
Complexities of convex combinations and bounding the generalization error in classification
We introduce and study several measures of complexity of functions from the
convex hull of a given base class. These complexity measures take into account
the sparsity of the weights of a convex combination as well as certain
clustering properties of the base functions involved in it. We prove new upper
confidence bounds on the generalization error of ensemble (voting)
classification algorithms that utilize the new complexity measures along with
the empirical distributions of classification margins, providing a better
explanation of generalization performance of large margin classification
methods.Comment: Published at http://dx.doi.org/10.1214/009053605000000228 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …