9,242 research outputs found

    A Tight Bound of Hard Thresholding

    Full text link
    This paper is concerned with the hard thresholding operator which sets all but the kk largest absolute elements of a vector to zero. We establish a {\em tight} bound to quantitatively characterize the deviation of the thresholded solution from a given signal. Our theoretical result is universal in the sense that it holds for all choices of parameters, and the underlying analysis depends only on fundamental arguments in mathematical optimization. We discuss the implications for two domains: Compressed Sensing. On account of the crucial estimate, we bridge the connection between the restricted isometry property (RIP) and the sparsity parameter for a vast volume of hard thresholding based algorithms, which renders an improvement on the RIP condition especially when the true sparsity is unknown. This suggests that in essence, many more kinds of sensing matrices or fewer measurements are admissible for the data acquisition procedure. Machine Learning. In terms of large-scale machine learning, a significant yet challenging problem is learning accurate sparse models in an efficient manner. In stark contrast to prior work that attempted the β„“1\ell_1-relaxation for promoting sparsity, we present a novel stochastic algorithm which performs hard thresholding in each iteration, hence ensuring such parsimonious solutions. Equipped with the developed bound, we prove the {\em global linear convergence} for a number of prevalent statistical models under mild assumptions, even though the problem turns out to be non-convex.Comment: V1 was submitted to COLT 2016. V2 fixes minor flaws, adds extra experiments and discusses time complexity, V3 has been accepted to JML

    Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization

    Full text link
    Stochastic optimization algorithms update models with cheap per-iteration costs sequentially, which makes them amenable for large-scale data analysis. Such algorithms have been widely studied for structured sparse models where the sparsity information is very specific, e.g., convex sparsity-inducing norms or β„“0\ell^0-norm. However, these norms cannot be directly applied to the problem of complex (non-convex) graph-structured sparsity models, which have important application in disease outbreak and social networks, etc. In this paper, we propose a stochastic gradient-based method for solving graph-structured sparsity constraint problems, not restricted to the least square loss. We prove that our algorithm enjoys a linear convergence up to a constant error, which is competitive with the counterparts in the batch learning setting. We conduct extensive experiments to show the efficiency and effectiveness of the proposed algorithms.Comment: published in ICML-201

    Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

    Full text link
    Iterative Hard Thresholding (IHT) is a class of projected gradient descent methods for optimizing sparsity-constrained minimization models, with the best known efficiency and scalability in practice. As far as we know, the existing IHT-style methods are designed for sparse minimization in primal form. It remains open to explore duality theory and algorithms in such a non-convex and NP-hard problem setting. In this paper, we bridge this gap by establishing a duality theory for sparsity-constrained minimization with β„“2\ell_2-regularized loss function and proposing an IHT-style algorithm for dual maximization. Our sparse duality theory provides a set of sufficient and necessary conditions under which the original NP-hard/non-convex problem can be equivalently solved in a dual formulation. The proposed dual IHT algorithm is a super-gradient method for maximizing the non-smooth dual objective. An interesting finding is that the sparse recovery performance of dual IHT is invariant to the Restricted Isometry Property (RIP), which is required by virtually all the existing primal IHT algorithms without sparsity relaxation. Moreover, a stochastic variant of dual IHT is proposed for large-scale stochastic optimization. Numerical results demonstrate the superiority of dual IHT algorithms to the state-of-the-art primal IHT-style algorithms in model estimation accuracy and computational efficiency

    Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction

    Full text link
    We propose a stochastic variance reduced optimization algorithm for solving sparse learning problems with cardinality constraints. Sufficient conditions are provided, under which the proposed algorithm enjoys strong linear convergence guarantees and optimal estimation accuracy in high dimensions. We further extend the proposed algorithm to an asynchronous parallel variant with a near linear speedup. Numerical experiments demonstrate the efficiency of our algorithm in terms of both parameter estimation and computational performance

    Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation

    Full text link
    We develop new stochastic gradient methods for efficiently solving sparse linear regression in a partial attribute observation setting, where learners are only allowed to observe a fixed number of actively chosen attributes per example at training and prediction times. It is shown that the methods achieve essentially a sample complexity of O(1/Ξ΅)O(1/\varepsilon) to attain an error of Ξ΅\varepsilon under a variant of restricted eigenvalue condition, and the rate has better dependency on the problem dimension than existing methods. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate of our proposed {\it Hybrid} algorithm can be boosted to near the minimax optimal sample complexity of {\it full information} algorithms. The core ideas are (i) efficient construction of an unbiased gradient estimator by the iterative usage of the hard thresholding operator for configuring an exploration algorithm; and (ii) an adaptive combination of the exploration and an exploitation algorithms for quickly identifying the support of the optimum and efficiently searching the optimal parameter in its support. Experimental results are presented to validate our theoretical findings and the superiority of our proposed methods.Comment: 23 pages, 2 figure

    CNNs are Globally Optimal Given Multi-Layer Support

    Full text link
    Stochastic Gradient Descent (SGD) is the central workhorse for training modern CNNs. Although giving impressive empirical performance it can be slow to converge. In this paper we explore a novel strategy for training a CNN using an alternation strategy that offers substantial speedups during training. We make the following contributions: (i) replace the ReLU non-linearity within a CNN with positive hard-thresholding, (ii) reinterpret this non-linearity as a binary state vector making the entire CNN linear if the multi-layer support is known, and (iii) demonstrate that under certain conditions a global optima to the CNN can be found through local descent. We then employ a novel alternation strategy (between weights and support) for CNN training that leads to substantially faster convergence rates, nice theoretical properties, and achieving state of the art results across large scale datasets (e.g. ImageNet) as well as other standard benchmarks

    Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via β„“1,β„“0\ell_1, \ell_0, and transformed-β„“1\ell_1 Penalties

    Full text link
    Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, no-overlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the lack of non-smoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under β„“1,β„“0\ell_1, \ell_0; and transformed-β„“1\ell_1 penalties, no-overlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity trade-off among the penalties

    Trainlets: Dictionary Learning in High Dimensions

    Full text link
    Sparse representations has shown to be a very powerful model for real world signals, and has enabled the development of applications with notable performance. Combined with the ability to learn a dictionary from signal examples, sparsity-inspired algorithms are often achieving state-of-the-art results in a wide variety of tasks. Yet, these methods have traditionally been restricted to small dimensions mainly due to the computational constraints that the dictionary learning problem entails. In the context of image processing, this implies handling small image patches. In this work we show how to efficiently handle bigger dimensions and go beyond the small patches in sparsity-based signal and image processing methods. We build our approach based on a new cropped wavelet decomposition, which enables a multi-scale analysis with virtually no border effects. We then employ this as the base dictionary within a double sparsity model to enable the training of adaptive dictionaries. To cope with the increase of training data, while at the same time improving the training performance, we present an Online Sparse Dictionary Learning (OSDL) algorithm to train this model effectively, enabling it to handle millions of examples. This work shows that dictionary learning can be up-scaled to tackle a new level of signal dimensions, obtaining large adaptable atoms that we call trainlets

    On The Projection Operator to A Three-view Cardinality Constrained Set

    Full text link
    The cardinality constraint is an intrinsic way to restrict the solution structure in many domains, for example, sparse learning, feature selection, and compressed sensing. To solve a cardinality constrained problem, the key challenge is to solve the projection onto the cardinality constraint set, which is NP-hard in general when there exist multiple overlapped cardinality constraints. In this paper, we consider the scenario where the overlapped cardinality constraints satisfy a Three-view Cardinality Structure (TVCS), which reflects the natural restriction in many applications, such as identification of gene regulatory networks and task-worker assignment problem. We cast the projection into a linear programming, and show that for TVCS, the vertex solution of this linear programming is the solution for the original projection problem. We further prove that such solution can be found with the complexity proportional to the number of variables and constraints. We finally use synthetic experiments and two interesting applications in bioinformatics and crowdsourcing to validate the proposed TVCS model and method

    A Block Decomposition Algorithm for Sparse Optimization

    Full text link
    Sparse optimization is a central problem in machine learning and computer vision. However, this problem is inherently NP-hard and thus difficult to solve in general. Combinatorial search methods find the global optimal solution but are confined to small-sized problems, while coordinate descent methods are efficient but often suffer from poor local minima. This paper considers a new block decomposition algorithm that combines the effectiveness of combinatorial search methods and the efficiency of coordinate descent methods. Specifically, we consider a random strategy or/and a greedy strategy to select a subset of coordinates as the working set, and then perform a global combinatorial search over the working set based on the original objective function. We show that our method finds stronger stationary points than Amir Beck et al.'s coordinate-wise optimization method. In addition, we establish the convergence rate of our algorithm. Our experiments on solving sparse regularized and sparsity constrained least squares optimization problems demonstrate that our method achieves state-of-the-art performance in terms of accuracy. For example, our method generally outperforms the well-known greedy pursuit method.Comment: to appear in SIGKDD 202
    • …
    corecore