9,242 research outputs found
A Tight Bound of Hard Thresholding
This paper is concerned with the hard thresholding operator which sets all
but the largest absolute elements of a vector to zero. We establish a {\em
tight} bound to quantitatively characterize the deviation of the thresholded
solution from a given signal. Our theoretical result is universal in the sense
that it holds for all choices of parameters, and the underlying analysis
depends only on fundamental arguments in mathematical optimization. We discuss
the implications for two domains:
Compressed Sensing. On account of the crucial estimate, we bridge the
connection between the restricted isometry property (RIP) and the sparsity
parameter for a vast volume of hard thresholding based algorithms, which
renders an improvement on the RIP condition especially when the true sparsity
is unknown. This suggests that in essence, many more kinds of sensing matrices
or fewer measurements are admissible for the data acquisition procedure.
Machine Learning. In terms of large-scale machine learning, a significant yet
challenging problem is learning accurate sparse models in an efficient manner.
In stark contrast to prior work that attempted the -relaxation for
promoting sparsity, we present a novel stochastic algorithm which performs hard
thresholding in each iteration, hence ensuring such parsimonious solutions.
Equipped with the developed bound, we prove the {\em global linear convergence}
for a number of prevalent statistical models under mild assumptions, even
though the problem turns out to be non-convex.Comment: V1 was submitted to COLT 2016. V2 fixes minor flaws, adds extra
experiments and discusses time complexity, V3 has been accepted to JML
Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization
Stochastic optimization algorithms update models with cheap per-iteration
costs sequentially, which makes them amenable for large-scale data analysis.
Such algorithms have been widely studied for structured sparse models where the
sparsity information is very specific, e.g., convex sparsity-inducing norms or
-norm. However, these norms cannot be directly applied to the problem
of complex (non-convex) graph-structured sparsity models, which have important
application in disease outbreak and social networks, etc. In this paper, we
propose a stochastic gradient-based method for solving graph-structured
sparsity constraint problems, not restricted to the least square loss. We prove
that our algorithm enjoys a linear convergence up to a constant error, which is
competitive with the counterparts in the batch learning setting. We conduct
extensive experiments to show the efficiency and effectiveness of the proposed
algorithms.Comment: published in ICML-201
Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization
Iterative Hard Thresholding (IHT) is a class of projected gradient descent
methods for optimizing sparsity-constrained minimization models, with the best
known efficiency and scalability in practice. As far as we know, the existing
IHT-style methods are designed for sparse minimization in primal form. It
remains open to explore duality theory and algorithms in such a non-convex and
NP-hard problem setting. In this paper, we bridge this gap by establishing a
duality theory for sparsity-constrained minimization with -regularized
loss function and proposing an IHT-style algorithm for dual maximization. Our
sparse duality theory provides a set of sufficient and necessary conditions
under which the original NP-hard/non-convex problem can be equivalently solved
in a dual formulation. The proposed dual IHT algorithm is a super-gradient
method for maximizing the non-smooth dual objective. An interesting finding is
that the sparse recovery performance of dual IHT is invariant to the Restricted
Isometry Property (RIP), which is required by virtually all the existing primal
IHT algorithms without sparsity relaxation. Moreover, a stochastic variant of
dual IHT is proposed for large-scale stochastic optimization. Numerical results
demonstrate the superiority of dual IHT algorithms to the state-of-the-art
primal IHT-style algorithms in model estimation accuracy and computational
efficiency
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction
We propose a stochastic variance reduced optimization algorithm for solving
sparse learning problems with cardinality constraints. Sufficient conditions
are provided, under which the proposed algorithm enjoys strong linear
convergence guarantees and optimal estimation accuracy in high dimensions. We
further extend the proposed algorithm to an asynchronous parallel variant with
a near linear speedup. Numerical experiments demonstrate the efficiency of our
algorithm in terms of both parameter estimation and computational performance
Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation
We develop new stochastic gradient methods for efficiently solving sparse
linear regression in a partial attribute observation setting, where learners
are only allowed to observe a fixed number of actively chosen attributes per
example at training and prediction times. It is shown that the methods achieve
essentially a sample complexity of to attain an error of
under a variant of restricted eigenvalue condition, and the rate
has better dependency on the problem dimension than existing methods.
Particularly, if the smallest magnitude of the non-zero components of the
optimal solution is not too small, the rate of our proposed {\it Hybrid}
algorithm can be boosted to near the minimax optimal sample complexity of {\it
full information} algorithms. The core ideas are (i) efficient construction of
an unbiased gradient estimator by the iterative usage of the hard thresholding
operator for configuring an exploration algorithm; and (ii) an adaptive
combination of the exploration and an exploitation algorithms for quickly
identifying the support of the optimum and efficiently searching the optimal
parameter in its support. Experimental results are presented to validate our
theoretical findings and the superiority of our proposed methods.Comment: 23 pages, 2 figure
CNNs are Globally Optimal Given Multi-Layer Support
Stochastic Gradient Descent (SGD) is the central workhorse for training
modern CNNs. Although giving impressive empirical performance it can be slow to
converge. In this paper we explore a novel strategy for training a CNN using an
alternation strategy that offers substantial speedups during training. We make
the following contributions: (i) replace the ReLU non-linearity within a CNN
with positive hard-thresholding, (ii) reinterpret this non-linearity as a
binary state vector making the entire CNN linear if the multi-layer support is
known, and (iii) demonstrate that under certain conditions a global optima to
the CNN can be found through local descent. We then employ a novel alternation
strategy (between weights and support) for CNN training that leads to
substantially faster convergence rates, nice theoretical properties, and
achieving state of the art results across large scale datasets (e.g. ImageNet)
as well as other standard benchmarks
Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via , and transformed- Penalties
Sparsification of neural networks is one of the effective complexity
reduction methods to improve efficiency and generalizability. We consider the
problem of learning a one hidden layer convolutional neural network with ReLU
activation function via gradient descent under sparsity promoting penalties. It
is known that when the input data is Gaussian distributed, no-overlap networks
(without penalties) in regression problems with ground truth can be learned in
polynomial time at high probability. We propose a relaxed variable splitting
method integrating thresholding and gradient descent to overcome the lack of
non-smoothness in the loss function. The sparsity in network weight is realized
during the optimization (training) process. We prove that under ; and transformed- penalties, no-overlap networks can be learned
with high probability, and the iterative weights converge to a global limit
which is a transformation of the true weight under a novel thresholding
operation. Numerical experiments confirm theoretical findings, and compare the
accuracy and sparsity trade-off among the penalties
Trainlets: Dictionary Learning in High Dimensions
Sparse representations has shown to be a very powerful model for real world
signals, and has enabled the development of applications with notable
performance. Combined with the ability to learn a dictionary from signal
examples, sparsity-inspired algorithms are often achieving state-of-the-art
results in a wide variety of tasks. Yet, these methods have traditionally been
restricted to small dimensions mainly due to the computational constraints that
the dictionary learning problem entails. In the context of image processing,
this implies handling small image patches. In this work we show how to
efficiently handle bigger dimensions and go beyond the small patches in
sparsity-based signal and image processing methods. We build our approach based
on a new cropped wavelet decomposition, which enables a multi-scale analysis
with virtually no border effects. We then employ this as the base dictionary
within a double sparsity model to enable the training of adaptive dictionaries.
To cope with the increase of training data, while at the same time improving
the training performance, we present an Online Sparse Dictionary Learning
(OSDL) algorithm to train this model effectively, enabling it to handle
millions of examples. This work shows that dictionary learning can be up-scaled
to tackle a new level of signal dimensions, obtaining large adaptable atoms
that we call trainlets
On The Projection Operator to A Three-view Cardinality Constrained Set
The cardinality constraint is an intrinsic way to restrict the solution
structure in many domains, for example, sparse learning, feature selection, and
compressed sensing. To solve a cardinality constrained problem, the key
challenge is to solve the projection onto the cardinality constraint set, which
is NP-hard in general when there exist multiple overlapped cardinality
constraints. In this paper, we consider the scenario where the overlapped
cardinality constraints satisfy a Three-view Cardinality Structure (TVCS),
which reflects the natural restriction in many applications, such as
identification of gene regulatory networks and task-worker assignment problem.
We cast the projection into a linear programming, and show that for TVCS, the
vertex solution of this linear programming is the solution for the original
projection problem. We further prove that such solution can be found with the
complexity proportional to the number of variables and constraints. We finally
use synthetic experiments and two interesting applications in bioinformatics
and crowdsourcing to validate the proposed TVCS model and method
A Block Decomposition Algorithm for Sparse Optimization
Sparse optimization is a central problem in machine learning and computer
vision. However, this problem is inherently NP-hard and thus difficult to solve
in general. Combinatorial search methods find the global optimal solution but
are confined to small-sized problems, while coordinate descent methods are
efficient but often suffer from poor local minima. This paper considers a new
block decomposition algorithm that combines the effectiveness of combinatorial
search methods and the efficiency of coordinate descent methods. Specifically,
we consider a random strategy or/and a greedy strategy to select a subset of
coordinates as the working set, and then perform a global combinatorial search
over the working set based on the original objective function. We show that our
method finds stronger stationary points than Amir Beck et al.'s coordinate-wise
optimization method. In addition, we establish the convergence rate of our
algorithm. Our experiments on solving sparse regularized and sparsity
constrained least squares optimization problems demonstrate that our method
achieves state-of-the-art performance in terms of accuracy. For example, our
method generally outperforms the well-known greedy pursuit method.Comment: to appear in SIGKDD 202
- β¦