11,176 research outputs found

    Exclusive Sparsity Norm Minimization with Random Groups via Cone Projection

    Full text link
    Many practical applications such as gene expression analysis, multi-task learning, image recognition, signal processing, and medical data analysis pursue a sparse solution for the feature selection purpose and particularly favor the nonzeros \emph{evenly} distributed in different groups. The exclusive sparsity norm has been widely used to serve to this purpose. However, it still lacks systematical studies for exclusive sparsity norm optimization. This paper offers two main contributions from the optimization perspective: 1) We provide several efficient algorithms to solve exclusive sparsity norm minimization with either smooth loss or hinge loss (non-smooth loss). All algorithms achieve the optimal convergence rate O(1/k2)O(1/k^2) (kk is the iteration number). To the best of our knowledge, this is the first time to guarantee such convergence rate for the general exclusive sparsity norm minimization; 2) When the group information is unavailable to define the exclusive sparsity norm, we propose to use the random grouping scheme to construct groups and prove that if the number of groups is appropriately chosen, the nonzeros (true features) would be grouped in the ideal way with high probability. Empirical studies validate the efficiency of proposed algorithms, and the effectiveness of random grouping scheme on the proposed exclusive SVM formulation

    Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees

    Full text link
    In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the β„“1\ell_1 norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms

    Fast and Scalable Lasso via Stochastic Frank-Wolfe Methods with a Convergence Guarantee

    Full text link
    Frank-Wolfe (FW) algorithms have been often proposed over the last few years as efficient solvers for a variety of optimization problems arising in the field of Machine Learning. The ability to work with cheap projection-free iterations and the incremental nature of the method make FW a very effective choice for many large-scale problems where computing a sparse model is desirable. In this paper, we present a high-performance implementation of the FW method tailored to solve large-scale Lasso regression problems, based on a randomized iteration, and prove that the convergence guarantees of the standard FW method are preserved in the stochastic setting. We show experimentally that our algorithm outperforms several existing state of the art methods, including the Coordinate Descent algorithm by Friedman et al. (one of the fastest known Lasso solvers), on several benchmark datasets with a very large number of features, without sacrificing the accuracy of the model. Our results illustrate that the algorithm is able to generate the complete regularization path on problems of size up to four million variables in less than one minute

    Randomized sketch descent methods for non-separable linearly constrained optimization

    Full text link
    In this paper we consider large-scale smooth optimization problems with multiple linear coupled constraints. Due to the non-separability of the constraints, arbitrary random sketching would not be guaranteed to work. Thus, we first investigate necessary and sufficient conditions for the sketch sampling to have well-defined algorithms. Based on these sampling conditions we developed new sketch descent methods for solving general smooth linearly constrained problems, in particular, random sketch descent and accelerated random sketch descent methods. From our knowledge, this is the first convergence analysis of random sketch descent algorithms for optimization problems with multiple non-separable linear constraints. For the general case, when the objective function is smooth and non-convex, we prove for the non-accelerated variant sublinear rate in expectation for an appropriate optimality measure. In the smooth convex case, we derive for both algorithms, non-accelerated and accelerated random sketch descent, sublinear convergence rates in the expected values of the objective function. Additionally, if the objective function satisfies a strong convexity type condition, both algorithms converge linearly in expectation. In special cases, where complexity bounds are known for some particular sketching algorithms, such as coordinate descent methods for optimization problems with a single linear coupled constraint, our theory recovers the best-known bounds. We also show that when random sketch is sketching the coordinate directions randomly produces better results than the fixed selection rule. Finally, we present some numerical examples to illustrate the performances of our new algorithms.Comment: 28 page

    A Field Guide to Forward-Backward Splitting with a FASTA Implementation

    Full text link
    Non-differentiable and constrained optimization play a key role in machine learning, signal and image processing, communications, and beyond. For high-dimensional minimization problems involving large datasets or many unknowns, the forward-backward splitting method provides a simple, practical solver. Despite its apparently simplicity, the performance of the forward-backward splitting is highly sensitive to implementation details. This article is an introductory review of forward-backward splitting with a special emphasis on practical implementation concerns. Issues like stepsize selection, acceleration, stopping conditions, and initialization are considered. Numerical experiments are used to compare the effectiveness of different approaches. Many variations of forward-backward splitting are implemented in the solver FASTA (short for Fast Adaptive Shrinkage/Thresholding Algorithm). FASTA provides a simple interface for applying forward-backward splitting to a broad range of problems

    Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

    Full text link
    In this paper, we set forth a new vision of reinforcement learning developed by us over the past few years, one that yields mathematically rigorous solutions to longstanding important questions that have remained unresolved: (i) how to design reliable, convergent, and robust reinforcement learning algorithms (ii) how to guarantee that reinforcement learning satisfies pre-specified "safety" guarantees, and remains in a stable region of the parameter space (iii) how to design "off-policy" temporal difference learning algorithms in a reliable and stable manner, and finally (iv) how to integrate the study of reinforcement learning into the rich theory of stochastic optimization. In this paper, we provide detailed answers to all these questions using the powerful framework of proximal operators. The key idea that emerges is the use of primal dual spaces connected through the use of a Legendre transform. This allows temporal difference updates to occur in dual spaces, allowing a variety of important technical advantages. The Legendre transform elegantly generalizes past algorithms for solving reinforcement learning problems, such as natural gradient methods, which we show relate closely to the previously unconnected framework of mirror descent methods. Equally importantly, proximal operator theory enables the systematic development of operator splitting methods that show how to safely and reliably decompose complex products of gradients that occur in recent variants of gradient-based temporal difference learning. This key technical innovation makes it possible to finally design "true" stochastic gradient methods for reinforcement learning. Finally, Legendre transforms enable a variety of other benefits, including modeling sparsity and domain geometry. Our work builds extensively on recent work on the convergence of saddle-point algorithms, and on the theory of monotone operators.Comment: 121 page

    Proximal Distance Algorithms: Theory and Examples

    Full text link
    Proximal distance algorithms combine the classical penalty method of constrained minimization with distance majorization. If f(x)f(\boldsymbol{x}) is the loss function, and CC is the constraint set in a constrained minimization problem, then the proximal distance principle mandates minimizing the penalized loss f(x)+ρ2dist(x,C)2f(\boldsymbol{x})+\frac{\rho}{2}\mathop{dist}(x,C)^2 and following the solution xρ\boldsymbol{x}_{\rho} to its limit as ρ\rho tends to ∞\infty. At each iteration the squared Euclidean distance dist(x,C)2\mathop{dist}(\boldsymbol{x},C)^2 is majorized by the spherical quadratic βˆ₯xβˆ’PC(xk)βˆ₯2\| \boldsymbol{x}-P_C(\boldsymbol{x}_k)\|^2, where PC(xk)P_C(\boldsymbol{x}_k) denotes the projection of the current iterate xk\boldsymbol{x}_k onto CC. The minimum of the surrogate function f(x)+ρ2βˆ₯xβˆ’PC(xk)βˆ₯2f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{x}-P_C(\boldsymbol{x}_k)\|^2 is given by the proximal map proxΟβˆ’1f[PC(xk)]\mathop{prox}_{\rho^{-1}f}[P_C(\boldsymbol{x}_k)]. The next iterate xk+1\boldsymbol{x}_{k+1} automatically decreases the original penalized loss for fixed ρ\rho. Since many explicit projections and proximal maps are known, it is straightforward to derive and implement novel optimization algorithms in this setting. These algorithms can take hundreds if not thousands of iterations to converge, but the stereotyped nature of each iteration makes proximal distance algorithms competitive with traditional algorithms. For convex problems, we prove global convergence. Our numerical examples include a) linear programming, b) nonnegative quadratic programming, c) projection to the closest kinship matrix, d) projection onto a second-order cone constraint, e) calculation of Horn's copositive matrix index, f) linear complementarity programming, and g) sparse principal components analysis. The proximal distance algorithm in each case is competitive or superior in speed to traditional methods.Comment: 23 pages, 2 figures, 7 table

    On the Suboptimality of Proximal Gradient Descent for β„“0\ell^{0} Sparse Approximation

    Full text link
    We study the proximal gradient descent (PGD) method for β„“0\ell^{0} sparse approximation problem as well as its accelerated optimization with randomized algorithms in this paper. We first offer theoretical analysis of PGD showing the bounded gap between the sub-optimal solution by PGD and the globally optimal solution for the β„“0\ell^{0} sparse approximation problem under conditions weaker than Restricted Isometry Property widely used in compressive sensing literature. Moreover, we propose randomized algorithms to accelerate the optimization by PGD using randomized low rank matrix approximation (PGD-RMA) and randomized dimension reduction (PGD-RDR). Our randomized algorithms substantially reduces the computation cost of the original PGD for the β„“0\ell^{0} sparse approximation problem, and the resultant sub-optimal solution still enjoys provable suboptimality, namely, the sub-optimal solution to the reduced problem still has bounded gap to the globally optimal solution to the original problem

    Efficient numerical algorithms for regularized regression problem with applications to traffic matrix estimations

    Full text link
    In this work we collect and compare to each other many different numerical methods for regularized regression problem and for the problem of projection on a hyperplane. Such problems arise, for example, as a subproblem of demand matrix estimation in IP- networks. In this special case matrix of affine constraints has special structure: all elements are 0 or 1 and this matrix is sparse enough. We have to deal with huge-scale convex optimization problem of special type. Using the properties of the problem we try "to look inside the black-box" and to see how the best modern methods work being applied to this problem.Comment: 16 pages; Information Technologies and Systems. Sochi: September, 201

    Sparse Trace Norm Regularization

    Full text link
    We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the β„“1\ell_1-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method and the alternating direction method of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms
    • …
    corecore