49,080 research outputs found
Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence
Block coordinate descent (BCD) methods are widely-used for large-scale
numerical optimization because of their cheap iteration costs, low memory
requirements, amenability to parallelization, and ability to exploit problem
structure. Three main algorithmic choices influence the performance of BCD
methods: the block partitioning strategy, the block selection rule, and the
block update rule. In this paper we explore all three of these building blocks
and propose variations for each that can lead to significantly faster BCD
methods. We (i) propose new greedy block-selection strategies that guarantee
more progress per iteration than the Gauss-Southwell rule; (ii) explore
practical issues like how to implement the new rules when using "variable"
blocks; (iii) explore the use of message-passing to compute matrix or Newton
updates efficiently on huge blocks for problems with a sparse dependency
between variables; and (iv) consider optimal active manifold identification,
which leads to bounds on the "active set complexity" of BCD methods and leads
to superlinear convergence for certain problems with sparse solutions (and in
some cases finite termination at an optimal solution). We support all of our
findings with numerical results for the classic machine learning problems of
least squares, logistic regression, multi-class logistic regression, label
propagation, and L1-regularization
Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for Sparsity Regularized Estimation
We analyze the convergence behaviour of a recently proposed algorithm for
regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is
based on a new interpretation of DAL as a proximal minimization algorithm. We
theoretically show under some conditions that DAL converges super-linearly in a
non-asymptotic and global sense. Due to a special modelling of sparse
estimation problems in the context of machine learning, the assumptions we make
are milder and more natural than those made in conventional analysis of
augmented Lagrangian algorithms. In addition, the new interpretation enables us
to generalize DAL to wide varieties of sparse estimation problems. We
experimentally confirm our analysis in a large scale -regularized
logistic regression problem and extensively compare the efficiency of DAL
algorithm to previously proposed algorithms on both synthetic and benchmark
datasets.Comment: 51 pages, 9 figure
Naive Feature Selection: Sparsity in Naive Bayes
Due to its linear complexity, naive Bayes classification remains an
attractive supervised learning method, especially in very large-scale settings.
We propose a sparse version of naive Bayes, which can be used for feature
selection. This leads to a combinatorial maximum-likelihood problem, for which
we provide an exact solution in the case of binary data, or a bound in the
multinomial case. We prove that our bound becomes tight as the marginal
contribution of additional features decreases. Both binary and multinomial
sparse models are solvable in time almost linear in problem size, representing
a very small extra relative cost compared to the classical naive Bayes.
Numerical experiments on text data show that the naive Bayes feature selection
method is as statistically effective as state-of-the-art feature selection
methods such as recursive feature elimination, -penalized logistic
regression and LASSO, while being orders of magnitude faster. For a large data
set, having more than with million training points and about million
features, and with a non-optimized CPU implementation, our sparse naive Bayes
model can be trained in less than 15 seconds
L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework
Despite the importance of sparsity in many large-scale applications, there
are few methods for distributed optimization of sparsity-inducing objectives.
In this paper, we present a communication-efficient framework for
L1-regularized optimization in the distributed environment. By viewing
classical objectives in a more general primal-dual setting, we develop a new
class of methods that can be efficiently distributed and applied to common
sparsity-inducing models, such as Lasso, sparse logistic regression, and
elastic net-regularized problems. We provide theoretical convergence guarantees
for our framework, and demonstrate its efficiency and flexibility with a
thorough experimental comparison on Amazon EC2. Our proposed framework yields
speedups of up to 50x as compared to current state-of-the-art methods for
distributed L1-regularized optimization
A General Framework of Large-Scale Convex Optimization Using Jensen Surrogates and Acceleration Techniques
In a world where data rates are growing faster than computing power, algorithmic acceleration based on developments in mathematical optimization plays a crucial role in narrowing the gap between the two. As the scale of optimization problems in many fields is getting larger, we need faster optimization methods that not only work well in theory, but also work well in practice by exploiting underlying state-of-the-art computing technology.
In this document, we introduce a unified framework of large-scale convex optimization using Jensen surrogates, an iterative optimization method that has been used in different fields since the 1970s. After this general treatment, we present non-asymptotic convergence analysis of this family of methods and the motivation behind developing accelerated variants. Moreover, we discuss widely used acceleration techniques for convex optimization and then investigate acceleration techniques that can be used within the Jensen surrogate framework while proposing several novel acceleration methods. Furthermore, we show that proposed methods perform competitively with or better than state-of-the-art algorithms for several applications including Sparse Linear Regression (Image Deblurring), Positron Emission Tomography, X-Ray Transmission Tomography, Logistic Regression, Sparse Logistic Regression and Automatic Relevance Determination for X-Ray Transmission Tomography
Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Majorization-minimization algorithms consist of iteratively minimizing a
majorizing surrogate of an objective function. Because of its simplicity and
its wide applicability, this principle has been very popular in statistics and
in signal processing. In this paper, we intend to make this principle scalable.
We introduce a stochastic majorization-minimization scheme which is able to
deal with large-scale or possibly infinite data sets. When applied to convex
optimization problems under suitable assumptions, we show that it achieves an
expected convergence rate of after iterations, and of
for strongly convex functions. Equally important, our scheme almost
surely converges to stationary points for a large class of non-convex problems.
We develop several efficient algorithms based on our framework. First, we
propose a new stochastic proximal gradient method, which experimentally matches
state-of-the-art solvers for large-scale -logistic regression. Second,
we develop an online DC programming algorithm for non-convex sparse estimation.
Finally, we demonstrate the effectiveness of our approach for solving
large-scale structured matrix factorization problems.Comment: accepted for publication for Neural Information Processing Systems
(NIPS) 2013. This is the 9-pages version followed by 16 pages of appendices.
The title has changed compared to the first technical repor
Sparse Bilinear Logistic Regression
In this paper, we introduce the concept of sparse bilinear logistic
regression for decision problems involving explanatory variables that are
two-dimensional matrices. Such problems are common in computer vision,
brain-computer interfaces, style/content factorization, and parallel factor
analysis. The underlying optimization problem is bi-convex; we study its
solution and develop an efficient algorithm based on block coordinate descent.
We provide a theoretical guarantee for global convergence and estimate the
asymptotical convergence rate using the Kurdyka-{\L}ojasiewicz inequality. A
range of experiments with simulated and real data demonstrate that sparse
bilinear logistic regression outperforms current techniques in several
important applications.Comment: 27 pages, 5 figure
- …